[00:03:18] <mutante>	 !log restarting gerrit to increase heap_size from 20G to 32G (T225166 T222391)
[00:03:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:28] <stashbot>	 T225166: Gerrit crashed due to out of Heap - https://phabricator.wikimedia.org/T225166
[00:03:28] <stashbot>	 T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391
[00:05:15] <wikibugs>	 (03CR) 10Dzahn: "gerrit restarted" [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn)
[00:28:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "cosmetic reasons, it's not like we receive mail here anymore" [puppet] - 10https://gerrit.wikimedia.org/r/544078 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn)
[00:29:17] <wikibugs>	 (03PS2) 10Dzahn: exim: switch mail for RT to moscovium [puppet] - 10https://gerrit.wikimedia.org/r/544078 (https://phabricator.wikimedia.org/T180641)
[00:36:31] <wikibugs>	 (03PS2) 10Dzahn: ATS/varnish: replace director for RT with moscovium [puppet] - 10https://gerrit.wikimedia.org/r/544077 (https://phabricator.wikimedia.org/T180641)
[00:43:40] <wikibugs>	 (03PS3) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567
[00:45:11] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[00:45:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db2065 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[00:51:09] <wikibugs>	 (03PS1) 10Dzahn: cumin: fix some Python lint in wmf_auto_reimage_lib [puppet] - 10https://gerrit.wikimedia.org/r/545687
[00:53:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cumin: fix some Python lint in wmf_auto_reimage_lib [puppet] - 10https://gerrit.wikimedia.org/r/545687 (owner: 10Dzahn)
[00:58:46] <wikibugs>	 (03CR) 10Dzahn: "finally fixed?" [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn)
[00:59:01] <wikibugs>	 (03PS2) 10Dzahn: cumin: fix some Python lint in wmf_auto_reimage_lib [puppet] - 10https://gerrit.wikimedia.org/r/545687
[01:16:35] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:16:41] <icinga-wm>	 PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:18:11] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:18:17] <icinga-wm>	 RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:02:11] <wikibugs>	 (03PS1) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689
[02:42:29] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[03:01:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20309576 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:02:51] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 52112 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:06:19] <wikibugs>	 (03CR) 10BBlack: "The DHCP part only covers the first 9 hosts in Rack 15 (updated commitmsg to be more correct and explicit about it).  So no dns3001 yet." [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[03:06:53] <wikibugs>	 (03PS3) 10BBlack: Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294)
[03:08:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:11:47] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] basic DNS entries for new esams hosts [dns] - 10https://gerrit.wikimedia.org/r/545662 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[03:19:06] <wikibugs>	 (03PS4) 10BBlack: Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294)
[03:37:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:42:46] <wikibugs>	 (03PS1) 10BBlack: cp30[56][0-9]: add hiera/conftool data [puppet] - 10https://gerrit.wikimedia.org/r/545691 (https://phabricator.wikimedia.org/T233242)
[03:46:25] <vgutierrez>	 hmmm cp30[56].... that needs to be allowed on the acme-chief config
[03:49:21] <wikibugs>	 (03PS1) 10BBlack: lvs300[567]: add public1-esams addrs [dns] - 10https://gerrit.wikimedia.org/r/545692 (https://phabricator.wikimedia.org/T236294)
[03:50:37] <wikibugs>	 (03PS1) 10Vgutierrez: acme_chief: Grant new esams cp hosts access to the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/545693 (https://phabricator.wikimedia.org/T234803)
[03:51:22] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Grant new esams cp hosts access to the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/545693 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez)
[03:55:17] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[03:55:38] <wikibugs>	 (03PS1) 10BBlack: lvs300[567]: LVS puppetization [puppet] - 10https://gerrit.wikimedia.org/r/545696 (https://phabricator.wikimedia.org/T236294)
[03:55:57] <shdubsh>	 !log temporarily turn down accept delay on fermium - T235983
[03:56:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:56:03] <stashbot>	 T235983: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983
[03:56:13] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] lvs300[567]: add public1-esams addrs [dns] - 10https://gerrit.wikimedia.org/r/545692 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[04:00:56] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[04:48:21] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove puppet references for db1070 [puppet] - 10https://gerrit.wikimedia.org/r/545698 (https://phabricator.wikimedia.org/T235464)
[04:48:31] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db1070 [dns] - 10https://gerrit.wikimedia.org/r/545699 (https://phabricator.wikimedia.org/T235464)
[04:48:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[04:48:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:48:42] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[04:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:51:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db1070 [puppet] - 10https://gerrit.wikimedia.org/r/545698 (https://phabricator.wikimedia.org/T235464) (owner: 10Marostegui)
[04:52:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db1070 [dns] - 10https://gerrit.wikimedia.org/r/545699 (https://phabricator.wikimedia.org/T235464) (owner: 10Marostegui)
[04:55:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1097:3315 after compression', diff saved to https://phabricator.wikimedia.org/P9460 and previous config saved to /var/cache/conftool/dbconfig/20191024-045544-marostegui.json
[04:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1089 from special slaves group and leave it with its original pooling options T223151', diff saved to https://phabricator.wikimedia.org/P9461 and previous config saved to /var/cache/conftool/dbconfig/20191024-045924-marostegui.json
[04:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:30] <stashbot>	 T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151
[04:59:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1097:3315 after compression', diff saved to https://phabricator.wikimedia.org/P9462 and previous config saved to /var/cache/conftool/dbconfig/20191024-045954-marostegui.json
[04:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:55] <wikibugs>	 (03PS1) 10Marostegui: db2048,db2061: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/545700 (https://phabricator.wikimedia.org/T228258)
[05:07:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2048,db2061: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/545700 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui)
[05:18:34] <marostegui>	 !log Run analyze enwiki.revision on db2092 T223151
[05:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:39] <stashbot>	 T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151
[05:20:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1097:3315 after compression', diff saved to https://phabricator.wikimedia.org/P9463 and previous config saved to /var/cache/conftool/dbconfig/20191024-052002-marostegui.json
[05:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:48] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Fix MAC addresses for new esams boxes [puppet] - 10https://gerrit.wikimedia.org/r/545701 (https://phabricator.wikimedia.org/T236294)
[06:13:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Fix MAC addresses for new esams boxes [puppet] - 10https://gerrit.wikimedia.org/r/545701 (https://phabricator.wikimedia.org/T236294) (owner: 10Vgutierrez)
[06:16:20] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: parsoid: actually support safe restarts at deploy time. [puppet] - 10https://gerrit.wikimedia.org/r/545702 (https://phabricator.wikimedia.org/T236275)
[06:18:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] parsoid: actually support safe restarts at deploy time. [puppet] - 10https://gerrit.wikimedia.org/r/545702 (https://phabricator.wikimedia.org/T236275) (owner: 10Giuseppe Lavagetto)
[06:30:10] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[06:31:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn)
[06:32:42] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=parsoid-php,name=eqiad
[06:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "> "error: Name 'parsoid-php.discovery.wmnet.': resolver plugin" [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn)
[06:36:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: add metafo record for parsoid-php [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn)
[06:38:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] add metafo record for parsoid-php [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn)
[06:40:01] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[06:48:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM; one nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545640 (owner: 10Jbond)
[06:52:51] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman
[07:02:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (PCC also seems fine: https://puppet-compiler.wmflabs.org/compiler1002/19035/)" [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond)
[07:07:07] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman
[07:09:01] <XioNoX>	 starting to work on mr1, will have to bring it down a couple times at least for upgrades
[07:12:38] <wikibugs>	 (03CR) 10Muehlenhoff: systemd: fixes in coredump class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli)
[07:15:41] <icinga-wm>	 PROBLEM - Juniper alarms on cr3-esams is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[07:16:13] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:17:51] <icinga-wm>	 PROBLEM - Host re0.cr3-esams is DOWN: PING CRITICAL - Packet loss = 100%
[07:22:02] <XioNoX>	 !log drain Telia link on cr2-esams
[07:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:57] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[07:22:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:34] <wikibugs>	 (03CR) 10Muehlenhoff: "Confirmed; anyone who's staff in some way (contractor, reqnr, vendor) should use the @wikimedia.org in data.yaml, we're running some consi" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[07:24:57] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[07:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:42] <vgutierrez>	 ok....
[07:29:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:30:05] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:30:11] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:31:17] <librenms-wmf>	 04Critical Alert for device cr3-esams.wikimedia.org - Juniper alarm active
[07:38:00] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Provide storage configuration for ats-backend on cp3055 [puppet] - 10https://gerrit.wikimedia.org/r/545706 (https://phabricator.wikimedia.org/T233242)
[07:42:44] <godog>	 !log bump rsyslog- topics partitions to 6 and roll-restart logstash frontends
[07:42:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:00] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Provide storage configuration for ats-backend on cp3055 [puppet] - 10https://gerrit.wikimedia.org/r/545706 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez)
[07:50:29] <wikibugs>	 (03PS1) 10Vgutierrez: install_server: Fix dns3002 FQDN [puppet] - 10https://gerrit.wikimedia.org/r/545709 (https://phabricator.wikimedia.org/T236217)
[07:50:57] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:51:21] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[07:51:46] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] install_server: Fix dns3002 FQDN [puppet] - 10https://gerrit.wikimedia.org/r/545709 (https://phabricator.wikimedia.org/T236217) (owner: 10Vgutierrez)
[07:53:15] <vgutierrez>	 ^^ that icinga error is expected/known?
[07:55:49] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:55:55] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:56:07] <wikibugs>	 (03CR) 10Volans: wmf_auto_reimage: Adjust message about waiting for puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn)
[07:57:12] <wikibugs>	 (03CR) 10ArielGlenn: "There's something amiss: https://puppet-compiler.wmflabs.org/compiler1001/19036/labstore1006.wikimedia.org/change.labstore1006.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata)
[07:57:15] <XioNoX>	 !log reboot mr1-esams
[07:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:18] <moritzm>	 the icinga error is "Could not find hostgroup matching 'asw2-esams.mgmt.esams.wmnet"
[07:57:32] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Provide ats storage config for new esams upload hosts [puppet] - 10https://gerrit.wikimedia.org/r/545711 (https://phabricator.wikimedia.org/T233242)
[07:57:40] <moritzm>	 so related to the mr1 maintenance
[07:57:41] <XioNoX>	 mutante: ah, that's because it's the lldp neighbor for the new servers
[07:57:47] <XioNoX>	 moritzm: ^
[07:57:57] <XioNoX>	 and it's not in icinga yet, I can add it
[07:57:58] <moritzm>	 yep, I was just curious why icinga1001 alerted
[07:58:35] <moritzm>	 no hurry, better complete the mr1 stuff first
[07:59:07] <wikibugs>	 (03CR) 10Ema: "The change itself looks good, but we first need to add profile::tlsproxy::envoy to role::requesttracker. I see that's commented out for no" [puppet] - 10https://gerrit.wikimedia.org/r/544077 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn)
[07:59:28] <XioNoX>	 because puppet add icinga parent/child relationships based on lldp
[07:59:31] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Gilles) It is indeed unusual for this to apply to specific pages of a small PDF, even moreso...
[08:00:09] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Provide varnish storage config for new cp text hosts [puppet] - 10https://gerrit.wikimedia.org/r/545712 (https://phabricator.wikimedia.org/T233242)
[08:02:26] <wikibugs>	 (03PS1) 10Ayounsi: Add asw2-esams to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/545713
[08:02:45] <XioNoX>	 moritzm: feel free to merge if you think it's ok - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545713
[08:07:03] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "It seems that the same logic is shared by all wdqs cookbooks, so a DRYer approach would be to move it to a small function that accept the " (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 (owner: 10Mathew.onipe)
[08:07:49] <wikibugs>	 (03PS2) 10Volans: fix unused format [cookbooks] - 10https://gerrit.wikimedia.org/r/545672 (owner: 10Mathew.onipe)
[08:08:28] <moritzm>	 XioNoX: currently in the middle of something, will have a look later
[08:09:20] <wikibugs>	 (03CR) 10Ema: [C: 03+1] hiera: Provide ats storage config for new esams upload hosts [puppet] - 10https://gerrit.wikimedia.org/r/545711 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez)
[08:09:35] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545687 (owner: 10Dzahn)
[08:13:05] <wikibugs>	 (03CR) 10Ema: [C: 03+1] hiera: Provide varnish storage config for new cp text hosts [puppet] - 10https://gerrit.wikimedia.org/r/545712 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez)
[08:14:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Provide ats storage config for new esams upload hosts [puppet] - 10https://gerrit.wikimedia.org/r/545711 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez)
[08:14:46] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Provide varnish storage config for new cp text hosts [puppet] - 10https://gerrit.wikimedia.org/r/545712 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez)
[08:15:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2092 for analyze table', diff saved to https://phabricator.wikimedia.org/P9465 and previous config saved to /var/cache/conftool/dbconfig/20191024-081519-marostegui.json
[08:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add asw2-esams to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/545713 (owner: 10Ayounsi)
[08:15:45] <wikibugs>	 (03CR) 10Ema: [C: 03+1] cp30[56][0-9]: add hiera/conftool data [puppet] - 10https://gerrit.wikimedia.org/r/545691 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack)
[08:16:05] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10fgiunchedi) Also a note for when the time comes: there's Prometheus data on this host that will need to be migrated onto a VM on esams' ganeti cluster once that's online
[08:17:58] <wikibugs>	 10Operations, 10observability: Icinga last puppet run check: re-enable relaxed per-host check - https://phabricator.wikimedia.org/T236345 (10Volans)
[08:18:08] <wikibugs>	 10Operations, 10observability: Icinga last puppet run check: re-enable relaxed per-host check - https://phabricator.wikimedia.org/T236345 (10Volans) p:05Triage→03Normal
[08:18:20] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/545672 (owner: 10Mathew.onipe)
[08:18:31] <godog>	 !log roll restart rsyslog in ulsfo/esams/eqsin to pick up new kafka partitions
[08:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:55] <wikibugs>	 (03Merged) 10jenkins-bot: fix unused format [cookbooks] - 10https://gerrit.wikimedia.org/r/545672 (owner: 10Mathew.onipe)
[08:19:58] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['dns3002.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto-reima...
[08:21:15] <XioNoX>	 moritzm: I merged it, let me know if it solves the issue
[08:21:17] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device mr1-esams.wikimedia.org recovered from Juniper alarm active
[08:22:44] <moritzm>	 Icinga will tell us in a bit :-)
[08:25:15] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Gilles) It seems like the ghostscript command used by Thumbor outputs some errors to stdout...
[08:25:55] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles)
[08:26:40] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) a:03Gilles
[08:27:10] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) I will try looking at this in my spare time, but can't promis...
[08:28:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cp30[56][0-9]: add hiera/conftool data [puppet] - 10https://gerrit.wikimedia.org/r/545691 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack)
[08:28:34] <wikibugs>	 (03PS2) 10Vgutierrez: cp30[56][0-9]: add hiera/conftool data [puppet] - 10https://gerrit.wikimedia.org/r/545691 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack)
[08:28:37] <wikibugs>	 (03CR) 10Ema: [C: 03+1] lvs300[567]: LVS puppetization [puppet] - 10https://gerrit.wikimedia.org/r/545696 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[08:33:56] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983 (10colewhite) a:03colewhite
[08:34:25] <wikibugs>	 (03PS1) 10Ema: puppetboard: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545716 (https://phabricator.wikimedia.org/T210411)
[08:35:20] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Elitre) In the meantime, you have all my appreciation.
[08:36:38] <godog>	 !log roll restart rsyslog in codfw/eqiad to pick up new kafka partitions
[08:36:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:01] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3002.wikimedia.org'] `  Of which those **FAILED**: ` ['dns3002.wikimedia.org'] `
[08:37:43] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[08:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:41] <wikibugs>	 (03PS1) 10Ema: puppetboard: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545717 (https://phabricator.wikimedia.org/T210411)
[08:39:44] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[08:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:52] <wikibugs>	 (03PS1) 10Ema: secret: dummy key for puppetboard [labs/private] - 10https://gerrit.wikimedia.org/r/545718 (https://phabricator.wikimedia.org/T210411)
[08:39:55] <wikibugs>	 (03PS1) 10Gehel: Maps: remove varnish URI sanitization for maps (now done in Kartotherian) [puppet] - 10https://gerrit.wikimedia.org/r/545723 (https://phabricator.wikimedia.org/T232817)
[08:39:57] <wikibugs>	 (03PS1) 10Ema: ATS: use TLS and DNS discovery to connect to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/545724 (https://phabricator.wikimedia.org/T210411)
[08:40:10] <wikibugs>	 10Operations, 10Discovery-Search, 10observability: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10fgiunchedi) Picking this up as part of {T235891}, and to answer your question @jbond the current way is via scap + puppet
[08:40:22] <wikibugs>	 (03PS2) 10Gehel: Maps: remove varnish URI sanitization for maps (now done in Kartotherian) [puppet] - 10https://gerrit.wikimedia.org/r/545723 (https://phabricator.wikimedia.org/T232817)
[08:40:43] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:41:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:41:36] <wikibugs>	 (03PS1) 10Ema: Add puppetboard.discovery.wmnet pointing to puppetboard1001 [dns] - 10https://gerrit.wikimedia.org/r/545733 (https://phabricator.wikimedia.org/T210411)
[08:43:52] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Add dns3002 to ntp_peers list [puppet] - 10https://gerrit.wikimedia.org/r/545744 (https://phabricator.wikimedia.org/T236217)
[08:44:37] <wikibugs>	 10Operations, 10Discovery-Search, 10observability: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10fgiunchedi)
[08:44:44] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi)
[08:44:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[08:44:57] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:45:33] <icinga-wm>	 RECOVERY - Juniper alarms on cr3-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[08:45:45] <icinga-wm>	 RECOVERY - Host re0.cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 83.82 ms
[08:45:47] <wikibugs>	 (03PS1) 10Muehlenhoff: The new esams cache hosts are configured to use the new NMVE setup in partman.cfg, but we also need to configure them in the late-command.sh script which performs the actual setup. [puppet] - 10https://gerrit.wikimedia.org/r/545752
[08:46:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Add dns3002 to ntp_peers list [puppet] - 10https://gerrit.wikimedia.org/r/545744 (https://phabricator.wikimedia.org/T236217) (owner: 10Vgutierrez)
[08:46:21] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Add dns3002 to ntp_peers list [puppet] - 10https://gerrit.wikimedia.org/r/545744 (https://phabricator.wikimedia.org/T236217)
[08:46:31] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[08:46:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] The new esams cache hosts are configured to use the new NMVE setup in partman.cfg, but we also need to configure them in the late-command.sh script which performs the actual setup. [puppet] - 10https://gerrit.wikimedia.org/r/545752 (owner: 10Muehlenhoff)
[08:47:46] <wikibugs>	 (03PS2) 10Muehlenhoff: Fix NVME setup for new esams caches [puppet] - 10https://gerrit.wikimedia.org/r/545752
[08:53:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:53:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:55:03] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Fix NVME setup for new esams caches [puppet] - 10https://gerrit.wikimedia.org/r/545752 (owner: 10Muehlenhoff)
[08:56:46] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['lvs3006.esams.wmnet'] ` The log can be found in `/var/log/wm...
[09:01:17] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr3-esams.wikimedia.org recovered from Juniper alarm active
[09:06:21] <wikibugs>	 (03PS3) 10Muehlenhoff: Fix NVME setup for new esams caches [puppet] - 10https://gerrit.wikimedia.org/r/545752
[09:06:30] <wikibugs>	 (03PS4) 10Jbond: puppet: clean up unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640
[09:07:13] <wikibugs>	 (03PS1) 10Gilles: Define performance survey in a more bulletproof way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545779 (https://phabricator.wikimedia.org/T234853)
[09:07:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix NVME setup for new esams caches [puppet] - 10https://gerrit.wikimedia.org/r/545752 (owner: 10Muehlenhoff)
[09:08:17] <wikibugs>	 (03PS2) 10Gilles: Define performance survey in a more bulletproof way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545779 (https://phabricator.wikimedia.org/T234853)
[09:09:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet: clean up unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 (owner: 10Jbond)
[09:09:23] <wikibugs>	 (03CR) 10Gilles: [C: 03+2] Define performance survey in a more bulletproof way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545779 (https://phabricator.wikimedia.org/T234853) (owner: 10Gilles)
[09:10:15] <wikibugs>	 (03Merged) 10jenkins-bot: Define performance survey in a more bulletproof way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545779 (https://phabricator.wikimedia.org/T234853) (owner: 10Gilles)
[09:12:15] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_wezen.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[09:12:16] <logmsgbot>	 !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T234853 Re-enable performance perception survey on ruwiki (duration: 01m 04s)
[09:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:20] <stashbot>	 T234853: Performance survey died on ruwiki on Sep 26 - https://phabricator.wikimedia.org/T234853
[09:12:27] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[09:12:28] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3055.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim...
[09:12:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:45] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_wezen.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[09:14:31] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[09:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:29] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[09:15:54] <godog>	 rsyslog expected btw ^
[09:15:59] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[09:22:58] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3006.esams.wmnet'] `  and were **ALL** successful.
[09:23:41] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Fix asw2-esams hostgroup name [puppet] - 10https://gerrit.wikimedia.org/r/545782
[09:24:54] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs300[567]: LVS puppetization [puppet] - 10https://gerrit.wikimedia.org/r/545696 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[09:25:04] <wikibugs>	 (03PS2) 10Vgutierrez: lvs300[567]: LVS puppetization [puppet] - 10https://gerrit.wikimedia.org/r/545696 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[09:26:12] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Hostname is good, dunno about the lldp parent/child logic and implications" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez)
[09:32:09] <icinga-wm>	 PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[09:32:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "In general LGTM. Minor comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545679 (https://phabricator.wikimedia.org/T235252) (owner: 10Alex Monk)
[09:33:22] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[09:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:27] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:23] <hashar>	 going to restart Jenkins to handle a plugin upgrade
[09:38:30] <wikibugs>	 (03PS1) 10Vgutierrez: lvs: Fix interface names for lvs300[5-7] and provide interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/545785 (https://phabricator.wikimedia.org/T236294)
[09:39:38] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs: Fix interface names for lvs300[5-7] and provide interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/545785 (https://phabricator.wikimedia.org/T236294) (owner: 10Vgutierrez)
[09:40:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: aptrepo: add elastic 7 [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854)
[09:43:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Note that package names in this case have '-oss' appended, i.e. elasticsearch-oss so puppet will need adjustments too." [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi)
[09:43:13] <wikibugs>	 (03PS2) 10Vgutierrez: hiera,netops: Fix asw2-esams hostname [puppet] - 10https://gerrit.wikimedia.org/r/545782
[09:43:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi)
[09:45:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "The elasticsearch puppet classes are already adapted for the -oss package names, but will likely need some fixups to cover 7.x as well." [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi)
[09:45:17] <wikibugs>	 10Operations, 10Discovery-Search, 10observability: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10fgiunchedi) With the upgrade to Elastic 7 as far as I can tell all logstash plugins we're shipping will be included already, in other words w...
[09:47:02] <wikibugs>	 (03PS4) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277)
[09:47:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] hiera,netops: Fix asw2-esams hostname [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez)
[09:47:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:47:49] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:48:22] <wikibugs>	 (03CR) 10Jbond: "moritzm, thanks for the review although i decided against using $settings::localcacert as such can you give another review, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond)
[09:48:41] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:48:49] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:48:59] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:49:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond)
[09:49:09] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:49:25] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:49:47] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:49:55] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:50:46] <vgutierrez>	 uh...
[09:50:53] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:51:01] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:51:22] <godog>	 mhh looks like a spike
[09:52:01] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:52:11] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:52:21] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:52:37] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[09:53:01] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:53:07] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:53:31] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[09:57:32] <hashar>	 !log Restarting CI Jenkins
[09:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:19] <wikibugs>	 (03PS5) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277)
[09:58:50] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3055.esams.wmnet'] `  and were **ALL** successful.
[09:59:38] <hashar>	 !log Converting CI jobs to use the new PostBuildScript plugin config | https://gerrit.wikimedia.org/r/#/c/integration/config/+/544907/ | T188398 
[09:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:26] <wikibugs>	 (03PS6) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277)
[10:10:56] <Lucas_WMDE>	 matthiasmullie: SWAT is only in one hour, right? because I just saw you +2ed one of the backports…
[10:11:01] <Lucas_WMDE>	 or maybe I’m wrong, let’s check
[10:11:03] <Lucas_WMDE>	 jouncebot: now
[10:11:03] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 48 minute(s)
[10:11:49] <matthiasmullie>	 Lucas_WMDE: yeah - getting a head start to get the test suites to complete in time
[10:11:58] <Lucas_WMDE>	 ok
[10:12:12] <matthiasmullie>	 there's 3 patches that depend on one another, so I'll merge them in sequentially
[10:17:00] <wikibugs>	 (03PS1) 10Jbond: jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545798
[10:17:40] <wikibugs>	 (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for puppetboard [labs/private] - 10https://gerrit.wikimedia.org/r/545718 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[10:17:42] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "merging to unblock icinga for now, we'll re-evaluate it later" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez)
[10:17:51] <vgutierrez>	 thx volans <3
[10:17:55] <wikibugs>	 10Operations, 10Traffic, 10observability: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10fgiunchedi) I've updated the frontend-traffic dashboard to include global availability correctly, and got rid of the summed value
[10:18:01] <wikibugs>	 (03CR) 10Ema: [C: 03+2] puppetboard: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545716 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[10:18:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s6 weights for db1093 and db1085', diff saved to https://phabricator.wikimedia.org/P9466 and previous config saved to /var/cache/conftool/dbconfig/20191024-101810-marostegui.json
[10:18:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:21] <wikibugs>	 (03PS2) 10Jbond: jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545798
[10:20:23] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Add puppetboard.discovery.wmnet pointing to puppetboard1001 [dns] - 10https://gerrit.wikimedia.org/r/545733 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[10:20:36] <wikibugs>	 (03PS3) 10Jbond: jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545798
[10:21:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "ugh, its there no way to override that? Perhaps some doc/changelog from juniper documenting this change would give us a cleaner way out?" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez)
[10:21:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545798 (owner: 10Jbond)
[10:22:51] <wikibugs>	 (03CR) 10Ema: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/19040/" [puppet] - 10https://gerrit.wikimedia.org/r/545717 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[10:24:16] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles)
[10:24:24] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) a:05Gilles→03None
[10:24:30] <wikibugs>	 (03CR) 10Volans: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez)
[10:25:02] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles)
[10:25:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] blubberoid: Add TLS termination (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto)
[10:25:32] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: use TLS and DNS discovery to connect to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/545724 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[10:25:51] <icinga-wm>	 PROBLEM - Host lvs3006 is DOWN: PING CRITICAL - Packet loss = 100%
[10:26:21] <icinga-wm>	 RECOVERY - Host lvs3006 is UP: PING OK - Packet loss = 0%, RTA = 83.44 ms
[10:26:32] <vgutierrez>	 ^^ that was expected...
[10:26:46] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles)
[10:27:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) (owner: 10Giuseppe Lavagetto)
[10:27:19] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) p:05Triage→03High
[10:27:28] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) a:03Krinkle
[10:27:59] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) Assigning this to you @krinkle so you can do more digging on the ResourceLoader side of things.
[10:28:08] <wikibugs>	 (03CR) 10Ayounsi: "that's the best doc so far - https://lists.gt.net/nsp/juniper/66466" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez)
[10:28:37] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles)
[10:28:48] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not/randomly picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles)
[10:29:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers maerlant.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:29:57] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[10:30:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "I like this approach, albeit it has one minor caveat, that is assumes all extra ports will be of a debug nature. Which is possibly fine. L" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto)
[10:30:02] <wikibugs>	 10Operations, 10observability: Tune HTTP availability alerts - https://phabricator.wikimedia.org/T236367 (10fgiunchedi)
[10:36:14] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201...
[10:37:48] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema)
[10:38:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: monitoring: alert on reduced global http availability [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367)
[10:38:31] <wikibugs>	 (03PS1) 10Jbond: jenkins: update java version for buster installs [puppet] - 10https://gerrit.wikimedia.org/r/545803
[10:39:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] monitoring: alert on reduced global http availability [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367) (owner: 10Filippo Giunchedi)
[10:41:08] <wikibugs>	 (03PS2) 10Jbond: jenkins: update java version for buster installs [puppet] - 10https://gerrit.wikimedia.org/r/545803
[10:41:25] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3057.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim...
[10:41:36] <wikibugs>	 (03CR) 10Effie Mouzeli: systemd: fixes in coredump class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli)
[10:45:45] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 39, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:45:59] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:47:15] <wikibugs>	 (03PS2) 10Filippo Giunchedi: monitoring: alert on reduced global http availability [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367)
[10:47:40] <hauskater>	 jouncebot: next
[10:47:40] <jouncebot>	 In 0 hour(s) and 12 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1100)
[10:47:47] <hauskater>	 jouncebot: refresh
[10:47:49] <jouncebot>	 I refreshed my knowledge about deployments.
[10:48:57] <icinga-wm>	 RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 43, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:49:42] <wikibugs>	 (03PS1) 10Vgutierrez: fix public interface name for lvs300[5-7] [dns] - 10https://gerrit.wikimedia.org/r/545805 (https://phabricator.wikimedia.org/T236294)
[10:50:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] jenkins: update java version for buster installs [puppet] - 10https://gerrit.wikimedia.org/r/545803 (owner: 10Jbond)
[10:50:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] fix public interface name for lvs300[5-7] [dns] - 10https://gerrit.wikimedia.org/r/545805 (https://phabricator.wikimedia.org/T236294) (owner: 10Vgutierrez)
[10:52:27] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] `  Of which those **FAILED**: ` ['cp3060.esams.wmnet'] `
[10:52:59] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.downtime
[10:53:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:00] <logmsgbot>	 !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[10:55:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:53] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[10:55:55] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:04] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[10:56:05] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:56:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:17] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[10:56:18] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:56:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:26] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[10:56:27] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:32] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[10:56:32] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:47] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[10:56:47] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[10:56:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:54] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[10:56:55] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:12] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[10:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1100).
[11:00:05] <jouncebot>	 matthiasmullie and hauskater: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:11] <hauskater>	 o/
[11:00:13] <Lucas_WMDE>	 o/
[11:00:14] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[11:00:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:21] <Urbanecm>	 I can SWAT today!
[11:00:29] <matthiasmullie>	 o/
[11:01:08] <Lucas_WMDE>	 matthiasmullie: are you deploying your own changes?
[11:01:12] <matthiasmullie>	 that's ok - I can do them this time :)
[11:01:15] <Lucas_WMDE>	 ok :)
[11:01:16] <matthiasmullie>	 yeah, I'll do them
[11:01:36] <Urbanecm>	 in that case, matthiasmullie please ping me once you're done
[11:02:05] <matthiasmullie>	 Urbanecm: will do!
[11:02:10] <Urbanecm>	 thanks!
[11:02:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Document database setup [software/debmonitor] - 10https://gerrit.wikimedia.org/r/545811
[11:04:01] <wikibugs>	 (03PS1) 10Jbond: jenkins: also update alternatives for java [puppet] - 10https://gerrit.wikimedia.org/r/545812
[11:07:08] <wikibugs>	 (03PS1) 10Ema: labweb: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545813 (https://phabricator.wikimedia.org/T210411)
[11:07:10] <wikibugs>	 (03PS1) 10Ema: cp3060: add to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545814 (https://phabricator.wikimedia.org/T233242)
[11:08:04] <wikibugs>	 (03CR) 10Ema: [C: 03+2] labweb: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545813 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[11:08:17] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[11:08:17] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[11:08:19] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cp3060: add to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545814 (https://phabricator.wikimedia.org/T233242) (owner: 10Ema)
[11:08:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:24] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[11:08:24] <logmsgbot>	 !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[11:08:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10Jclark-ctr) Starting Pdu Replacement
[11:09:19] <wikibugs>	 (03PS1) 10Urbanecm: Add CAT as alias for NS_CATEGORY at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545815 (https://phabricator.wikimedia.org/T236352)
[11:09:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] jenkins: also update alternatives for java [puppet] - 10https://gerrit.wikimedia.org/r/545812 (owner: 10Jbond)
[11:09:37] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3059.esams.wmnet'] ` The log can be found in `...
[11:09:42] <jclark-ctr>	 starting pdu refresh  b4-eqiad https://phabricator.wikimedia.org/T227540
[11:09:58] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/lo...
[11:10:09] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] `  Of which those **FAILED**: ` ['cp3060.esams.wmnet'] `
[11:10:20] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/lo...
[11:13:19] <logmsgbot>	 !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Allow defining entity-type-specific PrefetchingTermLookup (duration: 01m 06s)
[11:13:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:45] <wikibugs>	 (03PS1) 10Urbanecm: Permission changes of move-rootuserpages assignment at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359)
[11:14:11] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:14:31] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[11:14:47] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:15:43] <logmsgbot>	 !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/WikibaseMediaInfo: Also use custom PrefetchingTermLookup in SingleEntitySourceServices (duration: 01m 01s)
[11:15:47] <wikibugs>	 (03PS2) 10Urbanecm: Permission changes of move-rootuserpages assignment at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359)
[11:15:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:15:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:16:05] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:16:09] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[11:16:35] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3057.esams.wmnet'] `  and were **ALL** successful.
[11:16:45] <matthiasmullie>	 Urbanecm: done
[11:16:59] <Urbanecm>	 matthiasmullie: thanks!
[11:17:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Restrict uploads on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545655 (https://phabricator.wikimedia.org/T236307) (owner: 10MarcoAurelio)
[11:17:15] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2006:9536,cp2019:9536} site=codfw tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[11:17:19] <Urbanecm>	 hauskater: +2'ed your patch
[11:17:19] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[11:17:25] <hauskater>	 ack
[11:17:41] <vgutierrez>	 cp3060 noise is expected
[11:17:57] <wikibugs>	 (03Merged) 10jenkins-bot: Restrict uploads on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545655 (https://phabricator.wikimedia.org/T236307) (owner: 10MarcoAurelio)
[11:18:00] <wikibugs>	 (03PS1) 10Ayounsi: New esams knams links + tilaa oob interface rename [dns] - 10https://gerrit.wikimedia.org/r/545817
[11:18:23] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=cp1081:9536 site=eqiad tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[11:19:05] <icinga-wm>	 PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:19:33] <Urbanecm>	 hauskater: pulled at mwdebug1001, could you test please?
[11:19:39] <hauskater>	 sure, doing
[11:19:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add CAT as alias for NS_CATEGORY at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545815 (https://phabricator.wikimedia.org/T236352) (owner: 10Urbanecm)
[11:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: Add CAT as alias for NS_CATEGORY at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545815 (https://phabricator.wikimedia.org/T236352) (owner: 10Urbanecm)
[11:20:29] <hauskater>	 Urbanecm: lgtm, listgrouprights and sitebar upload link correctly configured
[11:20:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] New esams knams links + tilaa oob interface rename [dns] - 10https://gerrit.wikimedia.org/r/545817 (owner: 10Ayounsi)
[11:20:35] <hauskater>	 *sidebar
[11:20:38] <Urbanecm>	 hauskater: thanks, deploying
[11:21:25] <icinga-wm>	 PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:22:20] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: SWAT: 2d66deb: Restrict uploads on azwiki (T236307) (duration: 01m 03s)
[11:22:24] <Urbanecm>	 hauskater: synced
[11:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:31] <stashbot>	 T236307: "Fayl yüklə" (Upload file) in AzWiki's Tools' Bar to Upload Wizard in Commons - https://phabricator.wikimedia.org/T236307
[11:22:34] <hauskater>	 perfect
[11:22:36] <hauskater>	 thanks
[11:22:40] <Urbanecm>	 happy to help!
[11:23:54] <wikibugs>	 (03PS3) 10Urbanecm: Permission changes of move-rootuserpages assignment at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359)
[11:24:00] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Permission changes of move-rootuserpages assignment at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359) (owner: 10Urbanecm)
[11:24:47] <wikibugs>	 (03Merged) 10jenkins-bot: Permission changes of move-rootuserpages assignment at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359) (owner: 10Urbanecm)
[11:26:04] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e079956: Add CAT as alias for NS_CATEGORY at commonswiki (T236352) (duration: 01m 00s)
[11:26:05] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:26:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:12] <stashbot>	 T236352: Create CAT: redirect for Category: namespace on Commons - https://phabricator.wikimedia.org/T236352
[11:26:21] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[11:26:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:24] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[11:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:09] <wikibugs>	 (03PS5) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792)
[11:30:55] <icinga-wm>	 PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:30:59] <Urbanecm>	 !log Run mwscript namespaceDupes.php --wiki=commonswiki --add-prefix=FIXME --fix (T236352)
[11:31:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:13] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.downtime
[11:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:27] <icinga-wm>	 PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:31:49] <icinga-wm>	 PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:33:16] <logmsgbot>	 !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[11:33:17] <icinga-wm>	 PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:33:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:30] <wikibugs>	 (03PS2) 10Cmjohnson: Adding mgmt dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545268 (https://phabricator.wikimedia.org/T234076)
[11:33:49] <icinga-wm>	 PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:34:05] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Add cp305[5,7,9] to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545819 (https://phabricator.wikimedia.org/T233242)
[11:34:36] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545268 (https://phabricator.wikimedia.org/T234076) (owner: 10Cmjohnson)
[11:35:11] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 3a5cb68: Permission changes of move-rootuserpages assignment at commonswiki (T236359) (duration: 01m 00s)
[11:35:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:16] <stashbot>	 T236359: Remove the "move-rootuserpages" right from all user groups without autopatrol flag on Commons - https://phabricator.wikimedia.org/T236359
[11:35:54] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Vgutierrez)
[11:36:12] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Vgutierrez)
[11:36:45] <icinga-wm>	 PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:36:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Bacula is fine, the LVM module on the other hand is not ours (there are roughly 2 patches for it from us). I am not sure if it's best to p" [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond)
[11:36:56] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Vgutierrez)
[11:37:33] <icinga-wm>	 RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:37:39] <icinga-wm>	 RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:37:49] <icinga-wm>	 RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:37:59] <icinga-wm>	 RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:38:11] <icinga-wm>	 RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:38:11] <icinga-wm>	 RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:38:15] <icinga-wm>	 RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:38:19] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[11:38:37] <icinga-wm>	 RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:40:06] <Urbanecm>	 !log EU SWAT done
[11:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:38] <wikibugs>	 (03PS6) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792)
[11:42:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] hhvm: remove hhvm leftovers from apache configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[11:45:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[11:45:37] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3059.esams.wmnet'] `  and were **ALL** successful.
[11:47:28] <wikibugs>	 (03PS1) 10BBlack: esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242)
[11:47:46] <icinga-wm>	 PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:47:48] <icinga-wm>	 PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:00] <icinga-wm>	 PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:00] <icinga-wm>	 PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:04] <icinga-wm>	 PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:10] <icinga-wm>	 PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:20] <icinga-wm>	 PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:28] <icinga-wm>	 PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:28] <icinga-wm>	 PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:32] <icinga-wm>	 PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:38] <icinga-wm>	 PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:48:40] <icinga-wm>	 PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:49:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Add cp305[5,7,9] to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545819 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez)
[11:49:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond)
[11:49:56] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not/randomly picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) By the looks of the survey dashboard, the issue might have gone away, but i...
[11:49:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack)
[11:50:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[11:50:20] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp3060 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[11:51:27] <wikibugs>	 (03PS2) 10BBlack: esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242)
[11:51:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond)
[11:52:18] <icinga-wm>	 PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[11:52:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[11:52:20] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp3060 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[11:52:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:53:10] <effie>	 ^ looking
[11:53:15] <bblack>	 all the varnish/ipsec stuff is "expected" whlie bringing up new nodes
[11:54:20] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[11:54:20] <icinga-wm>	 PROBLEM - eventlogging Varnishkafka log producer on cp3060 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[11:55:21] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: install ruby-multi-json on buster for PUP-8715 [puppet] - 10https://gerrit.wikimedia.org/r/545821
[11:55:41] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Cmjohnson)
[11:56:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:56:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: install ruby-multi-json on buster for PUP-8715 [puppet] - 10https://gerrit.wikimedia.org/r/545821 (owner: 10Jbond)
[11:56:24] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[11:58:24] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:58:24] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[11:58:26] <icinga-wm>	 PROBLEM - statsv Varnishkafka log producer on cp3060 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1200)
[12:00:12] <icinga-wm>	 PROBLEM - Host phab1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[12:00:24] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:00:24] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[12:00:28] <XioNoX>	 !log shutdown transit BGP sessions on cr2-knams
[12:00:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:00:50] <icinga-wm>	 PROBLEM - Disk space on cp3060 is CRITICAL: DISK CRITICAL - /proc/sys/fs/binfmt_misc is not accessible: No such device https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3060&var-datasource=esams+prometheus/ops
[12:01:24] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[12:01:36] <wikibugs>	 (03PS1) 10Cmjohnson: Adding  production dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545823 (https://phabricator.wikimedia.org/T234076)
[12:01:44] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[12:02:20] <icinga-wm>	 PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:02:24] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance tls on cp3060 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:02:35] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding  production dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545823 (https://phabricator.wikimedia.org/T234076) (owner: 10Cmjohnson)
[12:02:36] <effie>	 ok the mediawiki errors
[12:02:44] <effie>	 are for pl.wikibooks.org 
[12:02:59] <effie>	 "Maximum execution time of 180 seconds exceeded"
[12:03:15] <effie>	 on the Scribunto extension 
[12:03:20] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Cmjohnson)
[12:03:30] <icinga-wm>	 RECOVERY - Host phab1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms
[12:04:24] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp3060 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:06:21] <icinga-wm>	 PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:06:21] <XioNoX>	 !log shutdown cr1-esams - cr2-knams link
[12:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:58] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms
[12:07:44] <wikibugs>	 (03PS3) 10BBlack: esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242)
[12:08:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2092 after analyze table', diff saved to https://phabricator.wikimedia.org/P9468 and previous config saved to /var/cache/conftool/dbconfig/20191024-120812-marostegui.json
[12:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:17] <marostegui>	 jynus: ^
[12:08:19] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) p:05Triage→03Low
[12:08:20] <icinga-wm>	 PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:09:10] <XioNoX>	 alright cr2-knams will go offline very soon the time I move the fibers and go replace the device
[12:09:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Cmjohnson) a:05Cmjohnson→03ArielGlenn @ArielGlenn  the onsite work has been completed, I did add the production dns
[12:10:24] <icinga-wm>	 PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:10:44] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack)
[12:10:58] <icinga-wm>	 PROBLEM - Juniper alarms on csw2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[12:11:26] <icinga-wm>	 PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100%
[12:11:42] <wikibugs>	 (03PS1) 10Jbond: idp: add secrets [labs/private] - 10https://gerrit.wikimedia.org/r/545826
[12:12:40] <icinga-wm>	 PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:40] <icinga-wm>	 PROBLEM - Host cp3035 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:40] <icinga-wm>	 PROBLEM - Host cp3036 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:40] <icinga-wm>	 PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:40] <icinga-wm>	 PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:41] <icinga-wm>	 PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:41] <icinga-wm>	 PROBLEM - Host cp3041 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:42] <icinga-wm>	 PROBLEM - Host cp3044 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:42] <icinga-wm>	 PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:43] <icinga-wm>	 PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:43] <icinga-wm>	 PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:43] <wikibugs>	 (03PS2) 10Jbond: idp: add secrets [labs/private] - 10https://gerrit.wikimedia.org/r/545826
[12:12:44] <icinga-wm>	 PROBLEM - Host cp3046 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:44] <icinga-wm>	 PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:45] <icinga-wm>	 PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:49] <bblack>	 XioNoX: ^ ?
[12:12:52] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:52] <icinga-wm>	 PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:52] <icinga-wm>	 PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:52] <icinga-wm>	 PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:52] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:52] <icinga-wm>	 PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:53] <icinga-wm>	 PROBLEM - Host multatuli is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:54] <volans>	 wut?
[12:12:58] <icinga-wm>	 PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:58] <icinga-wm>	 PROBLEM - Host bast3002 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:58] <icinga-wm>	 PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100%
[12:13:00] <icinga-wm>	 PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100%
[12:13:21] <bblack>	 prepping depool patch
[12:13:27] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[12:13:33] <marostegui>	 bblack: thanks
[12:13:45] <doctaxon>	 ticket.wikimedia.org is not reachable because of connection timed out
[12:13:45] <wikibugs>	 (03PS1) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/545827
[12:13:51] <volans>	 godog: those text-lb alerts dont' have #page but they page
[12:13:53] <volans>	 FYI
[12:13:55] <marostegui>	 doctaxon: we are on it
[12:14:01] <doctaxon>	 thx
[12:14:06] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-esams.mgmt.esams.wmnet is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[12:14:11] <paladox>	 Hmm, wikipedia down for me, just stuck loading
[12:14:15] <bblack>	 working on it
[12:14:17] <volans>	 bblack: go ahead with the depool, cannot reach wikis
[12:14:18] <akosiaris>	 asw-2 issues?
[12:14:20] <marostegui>	 paladox: on it
[12:14:30] <paladox>	 Thanks
[12:14:31] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:862:ed1a::2:b and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:14:40] <bblack>	 !log depool esams in geodns
[12:14:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:48] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on upload-lb.esams.wikimedia.org is CRITICAL: connect to address 91.198.174.208 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:15:00] <jynus>	 it is back up for me
[12:15:08] <XioNoX>	 looking
[12:15:11] <marostegui>	 not for me yet
[12:15:14] <wikibugs>	 (03CR) 10BBlack: [V: 03+2 C: 03+2] depool esams [dns] - 10https://gerrit.wikimedia.org/r/545827 (owner: 10BBlack)
[12:15:20] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:15:20] <icinga-wm>	 PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:15:22] <icinga-wm>	 RECOVERY - Host cp3042 is UP: PING WARNING - Packet loss = 28%, RTA = 83.44 ms
[12:15:22] <icinga-wm>	 RECOVERY - Host cp3032 is UP: PING WARNING - Packet loss = 28%, RTA = 83.43 ms
[12:15:22] <icinga-wm>	 RECOVERY - Host cp3047 is UP: PING WARNING - Packet loss = 28%, RTA = 83.39 ms
[12:15:22] <icinga-wm>	 RECOVERY - Host bast3002 is UP: PING WARNING - Packet loss = 28%, RTA = 83.47 ms
[12:15:22] <icinga-wm>	 RECOVERY - Host cp3036 is UP: PING WARNING - Packet loss = 28%, RTA = 83.41 ms
[12:15:24] <icinga-wm>	 RECOVERY - Host cp3045 is UP: PING WARNING - Packet loss = 28%, RTA = 83.43 ms
[12:15:24] <icinga-wm>	 RECOVERY - Host lvs3002 is UP: PING WARNING - Packet loss = 28%, RTA = 83.52 ms
[12:15:24] <icinga-wm>	 RECOVERY - Host cp3030 is UP: PING WARNING - Packet loss = 28%, RTA = 83.41 ms
[12:15:24] <icinga-wm>	 RECOVERY - Host cp3040 is UP: PING WARNING - Packet loss = 28%, RTA = 83.42 ms
[12:15:24] <icinga-wm>	 RECOVERY - Host cp3049 is UP: PING WARNING - Packet loss = 28%, RTA = 83.40 ms
[12:15:24] <icinga-wm>	 RECOVERY - Host cp3039 is UP: PING WARNING - Packet loss = 28%, RTA = 83.39 ms
[12:15:25] <icinga-wm>	 RECOVERY - Host cp3043 is UP: PING WARNING - Packet loss = 28%, RTA = 83.45 ms
[12:15:25] <icinga-wm>	 RECOVERY - Host cp3035 is UP: PING WARNING - Packet loss = 28%, RTA = 83.43 ms
[12:15:26] <volans>	 back for me too now,
[12:15:26] <icinga-wm>	 RECOVERY - Host cp3034 is UP: PING WARNING - Packet loss = 28%, RTA = 83.45 ms
[12:15:26] <icinga-wm>	 RECOVERY - Host cp3044 is UP: PING WARNING - Packet loss = 28%, RTA = 83.39 ms
[12:15:27] <icinga-wm>	 RECOVERY - Host cp3041 is UP: PING WARNING - Packet loss = 28%, RTA = 83.43 ms
[12:15:27] <icinga-wm>	 RECOVERY - Host lvs3003 is UP: PING WARNING - Packet loss = 37%, RTA = 83.50 ms
[12:15:28] <icinga-wm>	 RECOVERY - Host cp3046 is UP: PING WARNING - Packet loss = 44%, RTA = 83.42 ms
[12:15:28] <icinga-wm>	 RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 44%, RTA = 83.40 ms
[12:15:29] <icinga-wm>	 RECOVERY - Host maerlant is UP: PING WARNING - Packet loss = 44%, RTA = 83.48 ms
[12:15:29] <icinga-wm>	 RECOVERY - Host lvs3004 is UP: PING WARNING - Packet loss = 44%, RTA = 83.49 ms
[12:15:30] <icinga-wm>	 RECOVERY - Host cp3033 is UP: PING WARNING - Packet loss = 44%, RTA = 83.42 ms
[12:15:30] <icinga-wm>	 RECOVERY - Host cp3038 is UP: PING WARNING - Packet loss = 44%, RTA = 83.36 ms
[12:15:31] <icinga-wm>	 RECOVERY - Host nescio is UP: PING WARNING - Packet loss = 28%, RTA = 83.48 ms
[12:15:31] <icinga-wm>	 RECOVERY - Host multatuli is UP: PING OK - Packet loss = 0%, RTA = 83.48 ms
[12:15:33] <marostegui>	 back too now
[12:15:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_80: Servers cp3044.esams.wmnet, cp3039.esams.wmnet, cp3049.esams.wmnet, cp3047.esams.wmnet, cp3038.esams.wmnet are marked down but pooled: dns_rec_53: Servers maerlant.wikimedia.org are marked down but pooled: uploadlb6_80: Servers cp3034.esams.wmnet, cp3039.esams.wmnet, cp3045.esams.wmnet, cp3047.esams.wmnet, cp3038.esams.wmnet are
[12:15:34] <icinga-wm>	 pooled: uploadlb_443: Servers cp3036.esams.wmnet, cp3039.esams.wmnet, cp3034.esams.wmnet, cp3035.esams.wmnet, cp3045.esams.wmnet are marked down but pooled: uploadlb6_443: Servers cp3034.esams.wmnet, cp3044.esams.wmnet, cp3035.esams.wmnet are marked down but pooled: dns_rec6_53_udp: Servers maerlant.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech
[12:15:34] <icinga-wm>	 ki/PyBal
[12:15:35] <volans>	 did a router just reboot?
[12:15:39] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.35 ms
[12:15:44] <akosiaris>	 I think asw2
[12:15:45] <bblack>	 authdns-update couldn't even reach multatuli to deploy the depool
[12:15:45] <effie>	 we'll know soon 
[12:15:51] <bblack>	 but proceeding with depool until we understand
[12:15:52] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw2-esams.mgmt.esams.wmnet is OK: OK: UP: 6 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[12:15:52] <akosiaris>	 it alerted right before
[12:15:58] <bblack>	 there it goes
[12:16:15] <jynus>	 I was up on esams, apparently, not after depool
[12:16:16] <bblack>	 all authdns have the depool now
[12:16:16] <akosiaris>	 maybe PSUs or something ?
[12:16:17] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 432 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:16:21] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.36 ms
[12:16:24] <_joe_>	 hey one can't upgrade his irc bouncer and you all reboot a router :D
[12:16:36] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on upload-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 419 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:16:51] <apergos>	 and at last here come the recoveries (my pages are a couple minutes behind actual0
[12:16:56] <jynus>	 I have failed over  dc on dns now
[12:17:02] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 870 bytes in 0.345 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[12:17:12] <icinga-wm>	 RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:17:18] <paladox>	 Works here now!
[12:17:22] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:17:58] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[12:18:04] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:18:06] <XioNoX>	 System booted: 2019-10-24 12:11:50 UTC (00:06:02 ago)
[12:18:48] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 58.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:18:51] <XioNoX>	 yeah, wrong power cable got unplugged seems like :(
[12:19:36] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[12:19:44] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:20:28] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 72.85 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:23:40] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 49.97 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:25:20] <ema>	 !log cp3060: powercycle -- NMI watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [charon:1226] T233242
[12:25:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:26] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[12:26:04] <wikibugs>	 (03CR) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[12:26:37] <wikibugs>	 10Operations, 10procurement: 2x (5) C19-to-C20 power cables, 1.8m, red/blue - https://phabricator.wikimedia.org/T236377 (10mark)
[12:26:48] <godog>	 volans: thanks! I'll look into it
[12:28:02] <icinga-wm>	 RECOVERY - Disk space on cp3060 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3060&var-datasource=esams+prometheus/ops
[12:28:20] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp3060 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 540502 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS
[12:28:26] <icinga-wm>	 RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:26] <icinga-wm>	 RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:26] <icinga-wm>	 RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:26] <icinga-wm>	 RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:26] <icinga-wm>	 RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:26] <icinga-wm>	 RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:26] <icinga-wm>	 RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:27] <icinga-wm>	 RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:27] <icinga-wm>	 RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:28] <icinga-wm>	 RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:28] <icinga-wm>	 RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:29] <icinga-wm>	 RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:29] <icinga-wm>	 RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:30] <icinga-wm>	 RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:30] <icinga-wm>	 RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:31] <icinga-wm>	 RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:31] <icinga-wm>	 RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[12:28:36] <icinga-wm>	 RECOVERY - HTTPS Unified RSA on cp3060 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 573244 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS
[12:29:00] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[12:29:15] <volans>	 godog: so of the 5 critical and 5 recovery, 3 had the page hashtag, 2 didn't, the two being the v4 and v6 alerts for the text-lb
[12:29:44] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 539 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[12:29:53] <godog>	 volans: yeah I think it is the host alerts that don't have it, as opposed to lvs service alerts
[12:30:10] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[12:30:21] <volans>	 got it
[12:30:42] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 539 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[12:31:06] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 539 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[12:31:34] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance tls on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:31:36] <icinga-wm>	 RECOVERY - statsv Varnishkafka log producer on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:31:45] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] `  and were **ALL** successful.
[12:32:00] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[12:32:10] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp3060 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:32:14] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 538 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[12:32:44] <icinga-wm>	 RECOVERY - eventlogging Varnishkafka log producer on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[12:33:00] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 19427 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:33:08] <effie>	 !log Stopping puppet on all hosts including the hhvm class (C:hhvm) - 544864 - T229792
[12:33:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:33:13] <stashbot>	 T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792
[12:33:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: add secrets [labs/private] - 10https://gerrit.wikimedia.org/r/545826 (owner: 10Jbond)
[12:42:24] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: name=cp3060.esams.wmnet,service=varnish-be
[12:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:48] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:46:54] <wikibugs>	 10Operations, 10procurement: 2x (5) C19-to-C20 power cables, 1.8m, red/blue - https://phabricator.wikimedia.org/T236377 (10mark) The package just shipped and will hopefully arrive tomorrow. I created IM ticket SCTASK0120754 for shipment notification.
[12:47:06] <godog>	 volans: I thought it was just a tweak but it isn't, filed as T236379
[12:47:06] <stashbot>	 T236379: Include #page on host alerts that page SRE - https://phabricator.wikimedia.org/T236379
[12:47:28] <volans>	 ack, thx
[12:47:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Jclark-ctr) a:05Jclark-ctr→03RobH @RobH   Corrected cable issue on pdu
[12:47:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10Cmjohnson)
[12:48:02] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 6.475 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:48:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10Jclark-ctr) Finished pdu refresh
[12:55:34] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, just one suggestion as mentioned in the other CR" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[12:56:55] <effie>	 !log purge hhvm hhvm-luasandbox hhvm-tidy hhvm-wikidiff2 hhvm-dbg from mw* canaries  - T229792
[12:56:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:01] <stashbot>	 T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792
[12:59:52] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[13:00:04] <jouncebot>	 liw and brennen: Dear deployers, time to do the Mediawiki train - European Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1300).
[13:00:49] <effie>	 ^ godog, what is that ?
[13:01:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Ok, I had that backwards. LGTM then" [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[13:02:44] <godog>	 effie: the icinga configuration errors? icinga's unhappy for some reason
[13:03:09] <effie>	 hmm
[13:03:12] <wikibugs>	 (03PS1) 10Lars Wirzenius: all wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545835
[13:03:15] <wikibugs>	 (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545835 (owner: 10Lars Wirzenius)
[13:04:59] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545835 (owner: 10Lars Wirzenius)
[13:06:38] <logmsgbot>	 !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.3
[13:06:38] <wikibugs>	 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, and 3 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10mobrovac)
[13:06:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond)
[13:08:44] <volans>	 effie: icinga config broken
[13:08:52] <volans>	 Error: Could not find any contact matching 'hpham-email' (config file '/etc/icinga/objects/contactgroups.cfg', starting on line 77)
[13:10:35] <godog>	 phamhi: ^ re: your contact change
[13:17:08] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] Maps: remove varnish URI sanitization for maps (now done in Kartotherian) [puppet] - 10https://gerrit.wikimedia.org/r/545723 (https://phabricator.wikimedia.org/T232817) (owner: 10Gehel)
[13:17:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[13:17:50] <ema>	 !log set ats-be weights on new esams upload nodes T233242 
[13:17:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[13:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:55] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[13:18:20] <librenms-wmf>	 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80%
[13:18:23] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3053.esams.wmnet
[13:18:24] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3055.esams.wmnet
[13:18:25] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3065.esams.wmnet
[13:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:26] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3057.esams.wmnet
[13:18:27] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3061.esams.wmnet
[13:18:28] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3059.esams.wmnet
[13:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:29] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3051.esams.wmnet
[13:18:30] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3063.esams.wmnet
[13:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:20:35] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] query_service: separate categories from main blazegraph profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[13:20:52] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:21:14] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[13:22:03] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp30[56].*,cluster=cache_upload,service=nginx
[13:22:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:23] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp30[56].*,cluster=cache_upload,service=varnish-fe
[13:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:43] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp30[56].*,cluster=cache_text,service=nginx
[13:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:02] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp30[56].*,cluster=cache_text,service=varnish-fe
[13:23:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:26] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/weight=100; selector: name=cp30[56].*,service=varnish-be
[13:24:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:31] <logmsgbot>	 !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=varnish-be,name=cp30[56].*
[13:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:49] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "A few previous comments are still unaddressed" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[13:25:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Aside from the minor whitespace issue, +1 from me. And it's fine to keep the puppet structure as is." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo)
[13:25:46] <XioNoX>	 !log enable transit4/6 on cr2-knams
[13:25:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:08] <wikibugs>	 (03PS7) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792)
[13:27:04] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[13:27:43] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe)
[13:28:30] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.74 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[13:30:53] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool::scripts: add a helper script to initialize a node [puppet] - 10https://gerrit.wikimedia.org/r/545838
[13:31:20] <_joe_>	 bblack, ema ^^
[13:31:28] <_joe_>	 this would allow you to define a 
[13:31:35] <wikibugs>	 (03PS1) 10Filippo Giunchedi: nagios: rename s/hpham/phamhi/ [puppet] - 10https://gerrit.wikimedia.org/r/545839
[13:31:39] <_joe_>	 initialize-varnish-be script for instance
[13:31:49] <_joe_>	 that you can just run via cumin like "pool"
[13:32:20] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10Gehel) Note that APIFeatureUsage has the ELK cluster talk to the Cirrus elasticsearch cluster directly. This means that logstash version on ELK needs to be compatible with the elasticsearc...
[13:32:57] <wikibugs>	 (03CR) 10Phamhi: [C: 03+2] nagios: rename s/hpham/phamhi/ [puppet] - 10https://gerrit.wikimedia.org/r/545839 (owner: 10Filippo Giunchedi)
[13:34:36] <effie>	 !log enable puppet on mwdebug*
[13:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:59] <wikibugs>	 (03PS1) 10BBlack: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/545845
[13:36:29] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/545845 (owner: 10BBlack)
[13:36:40] <bblack>	 !log re-pooling esams in dns
[13:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:14] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[13:46:00] <ema>	 _joe_: thanks :)
[13:46:10] <_joe_>	 ema: I was thinking
[13:46:15] <_joe_>	 you might prefer a single command
[13:46:22] <_joe_>	 that goes through all the objects
[13:46:42] <_joe_>	 if that's the case, we can rework the patch
[13:47:15] <_joe_>	 ease-of-use vs granularity
[13:48:20] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%
[13:49:22] <wikibugs>	 (03PS1) 10Ema: cp3056: add to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545852 (https://phabricator.wikimedia.org/T233242)
[13:50:16] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3056.esams.wmnet'] ` The log can be found in `/var/lo...
[13:50:33] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cp3056: add to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545852 (https://phabricator.wikimedia.org/T233242) (owner: 10Ema)
[13:52:36] <wikibugs>	 (03PS1) 10Effie Mouzeli: hhvm: fixes in removal [puppet] - 10https://gerrit.wikimedia.org/r/545854 (https://phabricator.wikimedia.org/T229792)
[13:55:06] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "LGTM++" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367) (owner: 10Filippo Giunchedi)
[13:55:10] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:56:30] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:59:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] hhvm: fixes in removal [puppet] - 10https://gerrit.wikimedia.org/r/545854 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[13:59:44] <ema>	 !log pool cp3060 T233242
[13:59:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:50] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[14:04:40] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 33.12 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:05:19] <ema>	 ^ this is normal, traffic in eqiad went down due to esams repool
[14:06:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: fixes in removal [puppet] - 10https://gerrit.wikimedia.org/r/545854 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli)
[14:07:30] <icinga-wm>	 PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:07:56] <icinga-wm>	 PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 48 connecting: cp3056_v4, cp3056_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan
[14:08:18] <wikibugs>	 (03PS1) 10Ema: Add new esams cp hosts to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545857 (https://phabricator.wikimedia.org/T233242)
[14:08:56] <icinga-wm>	 PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 47, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:08:56] <icinga-wm>	 PROBLEM - BFD status on cr2-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:10:32] <icinga-wm>	 PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:10:36] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:12:01] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] Add new esams cp hosts to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545857 (https://phabricator.wikimedia.org/T233242) (owner: 10Ema)
[14:14:02] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Add new esams cp hosts to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545857 (https://phabricator.wikimedia.org/T233242) (owner: 10Ema)
[14:15:44] <wikibugs>	 (03CR) 10CDanis: "Haven't yet looked deeply at the code, but here's a few high-level questions and TODOs" (034 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus)
[14:16:31] <ema>	 !log power-cycle cp3056, stuck rebooting into d-i T233242
[14:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:37] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[14:19:11] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3058.esams.wmnet'] ` The log can be found in `/var/lo...
[14:19:57] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not/randomly picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Krinkle)
[14:20:13] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team: MediaWiki production config change not/randomly picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Krinkle)
[14:20:48] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 78.33 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:21:17] <wikibugs>	 (03CR) 10CDanis: metamonitoring: add sync of Icinga contacts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[14:21:23] <wikibugs>	 10Operations, 10ops-eqiad: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10ArielGlenn) Awesome. Do you need instructions for the raid setup or is that already taken care of?
[14:22:36] <effie>	 !log enable puppet on mw app canaries 
[14:22:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:23] <bblack>	 !log lvs3006 (upload, inactive) - manual pybal med s/100/90/ (preferred to lvs3004 for fallback from lvs3002)
[14:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:11] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) @BBlack here is the information for the CP servers in rack 16  cp3061 : xe-6/0/15 cp3062: xe-6/0/16 cp3063: xe-6/0/17 cp3064: xe-6/0/18 cp3065: xe-6/0/19
[14:24:43] <bblack>	 !lvs lvs3002 (upload, active) - manual pybal med s/0/50/ (restart will briefly flip traffic to lvs3006, will come back as primary again, but with non-zero med).
[14:26:29] <bblack>	 !log lvs3006 (upload, becoming active) - manual pybal med s/90/0/ (will take over from lvs3002, intended permanently).
[14:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:42] <godog>	 rlazarus: I saw your review passing by, re: local style for new repos I'd recommend going the 'black' route, i.e. https://phabricator.wikimedia.org/T211750#4851410
[14:30:08] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @BBlack lvs3007 switch information  xe-6/0/12
[14:31:42] <wikibugs>	 (03PS1) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792)
[14:33:24] <wikibugs>	 (03PS1) 10BBlack: esams upload lvs: 6 is primary, [24] are backups [puppet] - 10https://gerrit.wikimedia.org/r/545863
[14:35:05] <wikibugs>	 (03CR) 10Matthias Mullie: "This change is ready for review." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek)
[14:38:01] <wikibugs>	 (03PS2) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792)
[14:38:32] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] esams upload lvs: 6 is primary, [24] are backups [puppet] - 10https://gerrit.wikimedia.org/r/545863 (owner: 10BBlack)
[14:38:58] <wikibugs>	 (03CR) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek)
[14:39:52] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Introduce Elastic 7 support [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854)
[14:40:02] <logmsgbot>	 !log ema@cumin1001 START - Cookbook sre.hosts.downtime
[14:40:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:09] <effie>	 !log Remove  hhvm hhvm-luasandbox hhvm-tidy hhvm-wikidiff2 hhvm-dbg from all canaries and codfw  - T229792
[14:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:16] <stashbot>	 T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792
[14:42:01] <logmsgbot>	 !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[14:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:12] <wikibugs>	 (03CR) 10Matthias Mullie: "This change is ready for review." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek)
[14:44:37] <addshore>	 jouncebot now
[14:44:37] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 15 minute(s)
[14:44:44] <addshore>	 jouncebot: next
[14:44:45] <jouncebot>	 In 1 hour(s) and 15 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1600)
[14:45:24] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Papaul) ganeti3003 switch information  xe-6/0/13
[14:46:33] <wikibugs>	 (03PS3) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792)
[14:47:07] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+1] "Not that I know too much about this, but LGTM! :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek)
[14:47:43] <effie>	 !log run puppet on all canaries and codfw  - T229792
[14:47:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:52] <stashbot>	 T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792
[14:49:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Should be enough to get the ball rolling!" [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi)
[14:50:06] <wikibugs>	 (03CR) 10Addshore: [C: 03+2] Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek)
[14:50:59] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek)
[14:51:18] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10fgiunchedi)
[14:51:39] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi)
[14:54:53] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3058.esams.wmnet'] `  and were **ALL** successful.
[14:58:05] <bblack>	 !log cr2-esams - change fallback static route for high-traffic2 to lvs3006
[14:58:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:58:34] <bblack>	 !log cr3-esams - change fallback static route for high-traffic2 to lvs3006
[14:58:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:37] <bblack>	 !log cr2-esams - add missing lvs3005 IP to bgp pybal neighbor list
[15:00:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:56] <wikibugs>	 (03PS2) 10CDanis: lvs: do not page on karthoterian unavailability [puppet] - 10https://gerrit.wikimedia.org/r/545285 (owner: 10Giuseppe Lavagetto)
[15:04:02] <logmsgbot>	 !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: testcommonswiki, Enable Wikibase client access T223792 (duration: 00m 53s)
[15:04:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:07] <stashbot>	 T223792: Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items - https://phabricator.wikimedia.org/T223792
[15:06:32] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) >>! In T234854#5602802, @Gehel wrote: > Note that APIFeatureUsage has the ELK cluster talk to the Cirrus elasticsearch cluster directly. This means that logstash version on ELK...
[15:07:23] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] lvs: do not page on karthoterian unavailability [puppet] - 10https://gerrit.wikimedia.org/r/545285 (owner: 10Giuseppe Lavagetto)
[15:09:37] <ema>	 !log pool cp3055 (cache_upload) T233242
[15:09:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:42] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[15:09:53] <effie>	 !log Remove hhvm packages and enable puppet across the fleet - T229792
[15:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:01] <stashbot>	 T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792
[15:13:31] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH)
[15:14:53] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) I have remotely setup both ps1-oe15-esams and ps1-oe16-esams with network configuration.  They have NOT had their ports labeled, as this task doesn't list what is plugged into each port.  At...
[15:15:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:16:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367) (owner: 10Filippo Giunchedi)
[15:18:22] <effie>	 !log Slowly reload apache across the fleet (as we are enabling puppet) - T229792 
[15:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:27] <stashbot>	 T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792
[15:19:12] <ema>	 !log pool cp3058 (cache_text) T233242
[15:19:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:16] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[15:21:55] <wikibugs>	 (03PS1) 10Mholloway: Revert "lvs::monitor_services: increase number of tries before MCS is critical" [puppet] - 10https://gerrit.wikimedia.org/r/545873 (https://phabricator.wikimedia.org/T229286)
[15:22:17] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10RobH) p:05Triage→03Normal
[15:22:32] <icinga-wm>	 RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:22:33] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10RobH)
[15:22:43] <wikibugs>	 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10jijiki) I will take a look tomorrow, sorry for delaying this
[15:22:58] <wikibugs>	 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10jijiki) a:03jijiki
[15:24:41] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983 (10Pine) p:05High→03Unbreak! Does this bug affect emergency@ or legal@? In any case if this is delaying emails to oversighters then I think that UBN prio...
[15:25:14] <wikibugs>	 (03PS1) 10CDanis: swift alerts: check over https when appropriate [puppet] - 10https://gerrit.wikimedia.org/r/545874
[15:25:19] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10Aklapper)
[15:26:40] <rlazarus>	 godog: thanks! will look
[15:27:07] <bblack>	 !log asw2-esams: configure port descriptions and vlan/lvs groupings for all rack16 hosts (lvs3007, ganeti3003, bast3004, cp3061-5)
[15:27:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:10] <wikibugs>	 (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/compiler1002/19048/" [puppet] - 10https://gerrit.wikimedia.org/r/545874 (owner: 10CDanis)
[15:29:07] <wikibugs>	 (03PS1) 10Phamhi: icinga: fix permissions for SRE @ WMCS: Hieu Pham [puppet] - 10https://gerrit.wikimedia.org/r/545875 (https://phabricator.wikimedia.org/T228942)
[15:29:08] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10ssingh) This bug is also affecting the https://lists.wikimedia.org/mailman/listinfo/traffic-anomaly-report list though it is not a priorit...
[15:30:24] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 42.07 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:30:58] <wikibugs>	 (03PS2) 10Phamhi: icinga: fix permissions for SRE @ WMCS: Hieu Pham [puppet] - 10https://gerrit.wikimedia.org/r/545875 (https://phabricator.wikimedia.org/T228942)
[15:33:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: fix permissions for SRE @ WMCS: Hieu Pham [puppet] - 10https://gerrit.wikimedia.org/r/545875 (https://phabricator.wikimedia.org/T228942) (owner: 10Phamhi)
[15:33:56] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 41.64 le 60 Ema Bried traffic spike on eqsin varnish-fe, nothing to worry about https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:34:03] <wikibugs>	 (03CR) 10Phamhi: [C: 03+2] icinga: fix permissions for SRE @ WMCS: Hieu Pham [puppet] - 10https://gerrit.wikimedia.org/r/545875 (https://phabricator.wikimedia.org/T228942) (owner: 10Phamhi)
[15:34:18] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 82.89 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:35:40] <wikibugs>	 (03PS1) 10BBlack: esams: mgmt dns for rack 16 [dns] - 10https://gerrit.wikimedia.org/r/545880 (https://phabricator.wikimedia.org/T236294)
[15:37:39] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] esams: mgmt dns for rack 16 [dns] - 10https://gerrit.wikimedia.org/r/545880 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[15:40:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:40:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545874 (owner: 10CDanis)
[15:40:46] <ema>	 !log depool cp3030 (cache_text) T233242
[15:40:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:59] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[15:42:20] <icinga-wm>	 PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[15:45:42] <ema>	 !log depool cp3034 (cache_upload) T233242
[15:45:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:28] <icinga-wm>	 PROBLEM - Check systemd state on mw1270 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:51:21] <ema>	 !log depool cp3032 (cache_text) T233242
[15:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:25] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[15:52:41] <wikibugs>	 10Operations, 10Wikimedia-production-error: mwdebug1001 throws "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" - https://phabricator.wikimedia.org/T236401 (10Urbanecm)
[15:57:31] <wikibugs>	 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10mark) I'm a bit confused; as far as I know the...
[15:58:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Note the difference: you are using quotation marks in your examples, while there are none in the script." [puppet] - 10https://gerrit.wikimedia.org/r/542064 (owner: 10Mobrovac)
[15:59:23] <wikibugs>	 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki)
[15:59:37] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: helmfile_log_sal: Fix getting the user and host for logging [puppet] - 10https://gerrit.wikimedia.org/r/542064 (owner: 10Mobrovac)
[16:00:05] <jouncebot>	 godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1600).
[16:00:05] <jouncebot>	 urandom: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Sorry for taking so long to merge this. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/542064 (owner: 10Mobrovac)
[16:00:34] <urandom>	 ooo, a sticker.  tempting.
[16:01:02] <wikibugs>	 (03CR) 10RLazarus: "One quick reply, more to come shortly." (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus)
[16:02:40] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Performance-Team (Radar): Investigate recurrent GET latency spikes on MediaWiki appservers (Oct 16) - https://phabricator.wikimedia.org/T235872 (10Krinkle)
[16:03:07] <wikibugs>	 10Operations, 10serviceops, 10Performance-Team (Radar): Increased POST latency for MW app servers (Oct 2019) - https://phabricator.wikimedia.org/T235755 (10Krinkle)
[16:07:05] <ema>	 !log pool cp3057 (cache_upload) T233242
[16:07:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:10] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[16:14:21] <wikibugs>	 (03PS2) 10Ottomata: Include hadoop client packages and config on dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229)
[16:15:44] <wikibugs>	 (03CR) 10Ottomata: "Seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata)
[16:17:36] <wikibugs>	 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ka.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10MarcoAurelio)
[16:19:32] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+2] Sort debian/sha256sums explicitely [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/543187 (owner: 10DCausse)
[16:20:13] <cdanis>	 urandom: were you expecting this patch to be swatted?
[16:20:22] <cdanis>	 I still can't tell if anyone actually uses puppet swat
[16:20:38] <urandom>	 cdanis: yeah
[16:20:54] <urandom>	 I mean, I was trying to avoid the process of pestering people
[16:21:27] <urandom>	 but if swat comes and goes, I'll just revert to pestering, and weaponize this to justify why I am :)
[16:21:34] <cdanis>	 ahah
[16:22:33] <wikibugs>	 (03PS1) 10MarcoAurelio: (WIP) Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389)
[16:22:35] <cdanis>	 urandom: I'm fairly unfamiliar with Cassandra and our use of it, but you aren't, so if you expect this to be non-impactful, and to not require any special handholding (like disabling Puppet on some fraction of the fleet and then slowly re-enabling), I'm happy to +2 and merge
[16:23:36] <urandom>	 cdanis: most of this is non-normative, changes to comments and the like, meant to manage the signal-to-noise when comparing changes on the next upgrade
[16:23:39] <wikibugs>	 (03CR) 10EBernhardson: [C: 03+2] Bump experimental-highlighter to 6.5.4.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/543188 (https://phabricator.wikimedia.org/T236123) (owner: 10DCausse)
[16:23:48] <urandom>	 there are a couple of changes to defaults that seem harmless
[16:24:11] <wikibugs>	 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo)
[16:24:17] <wikibugs>	 (03PS7) 10CDanis: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans)
[16:24:20] <wikibugs>	 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) p:05Triage→03High
[16:24:30] <urandom>	 cdanis: I was going to restart a few nodes as canaries immediately, holler for a revert if there are issues, and rolling restart later if  things look good
[16:24:38] <cdanis>	 urandom: sounds good
[16:25:17] <wikibugs>	 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo)
[16:25:28] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans)
[16:25:46] <wikibugs>	 (03PS2) 10MarcoAurelio: (WIP) Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389)
[16:25:51] <cdanis>	 urandom: puppet-merged
[16:26:27] <wikibugs>	 (03PS3) 10MarcoAurelio: Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389)
[16:26:45] <wikibugs>	 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10akosiaris)
[16:27:16] <wikibugs>	 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo)
[16:28:13] <ema>	 !log depool cp3035 (cache_upload) T233242
[16:28:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:20] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[16:28:27] <urandom>	 cdanis: awesome; thanks!
[16:29:06] <wikibugs>	 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) So because of buster clients and jessie storage daemons cannot talk to each other, we will have to alter slightly the upgrade strategy. Several opt...
[16:30:26] <wikibugs>	 (03PS1) 10MarcoAurelio: (WIP) mediawiki::web:prod_sites.pp: Apache config for ka.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389)
[16:32:02] <wikibugs>	 (03PS2) 10Krinkle: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie)
[16:32:34] <wikibugs>	 (03PS3) 10Krinkle: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie)
[16:32:44] <wikibugs>	 (03CR) 10Krinkle: "End date still TBD as it depends on when memc is purged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie)
[16:32:57] <urandom>	 !log restarting cassandra, restbase1016 (canary for config changes) -- T200803
[16:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:02] <stashbot>	 T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803
[16:34:15] <wikibugs>	 (03PS3) 10Mobrovac: Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898)
[16:34:37] <wikibugs>	 (03PS2) 10MarcoAurelio: mediawiki::web:prod_sites.pp: Apache config for ka.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389)
[16:35:43] <wikibugs>	 (03CR) 10Jcrespo: "Yes, I have yet to apply a couple of fixes that both Filippo and Alex mentioned before merging." [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo)
[16:37:50] <hauskater>	 cdanis: hi. I wanted to test a puppet change with PCC but I'm not sure which host(s) should I choose. Could you help me?
[16:38:09] <cdanis>	 hauskater: sure, link me the change?
[16:38:29] <hauskater>	 cdanis: thank you, it's https://gerrit.wikimedia.org/r/545889
[16:38:49] <hauskater>	 I looked in site.pp for Apache but found nix
[16:39:05] <wikibugs>	 (03CR) 10Subramanya Sastry: Parsoid/PHP: Load the extension on all Parsoid nodes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac)
[16:39:08] <hauskater>	 and I'd rather not have the change tested on all hosts, that takes lots of time
[16:39:12] <urandom>	 !log restarting cassandra, restbase2011 (canary for config changes) -- T200803
[16:39:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:17] <stashbot>	 T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803
[16:40:24] <cdanis>	 hauskater: ah, yeah, Apache is not referenced as a role, it's just included from other configs.  you should be looking at appservers for that one, I think -- pick a couple machines that are mediawiki::appserver, and a couple mediawiki::appserver::api
[16:40:47] <hauskater>	 cdanis: alright, I'll pick some of those
[16:42:15] <hauskater>	 cdanis: mwdebug won't do the trick right?
[16:42:23] <cdanis>	 mwdebug will work as well
[16:42:32] <hauskater>	 I'll test that then
[16:44:53] <wikibugs>	 (03PS3) 10MarcoAurelio: mediawiki::web:prod_sites.pp: Apache config for ka.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389)
[16:45:02] <wikibugs>	 (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[16:45:08] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10fdans) p:05Triage→03High
[16:45:43] <wikibugs>	 (03PS1) 10BBlack: Add dhcp macaddrs for esams rack 16 hosts [puppet] - 10https://gerrit.wikimedia.org/r/545893 (https://phabricator.wikimedia.org/T236294)
[16:48:04] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add dhcp macaddrs for esams rack 16 hosts [puppet] - 10https://gerrit.wikimedia.org/r/545893 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[16:48:33] <hauskater>	 cdanis: that worked: https://puppet-compiler.wmflabs.org/compiler1001/321/ :)
[16:54:14] <ema>	 !log depool cp3036 (cache_upload) T233242
[16:54:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:20] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[16:55:04] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs3007.esams.wmnet'] ` The log can be found in `/var/log/wmf-au...
[16:55:17] <librenms-wmf>	 04Critical Alert for device asw2-esams.mgmt.esams.wmnet - Juniper alarm active
[16:55:32] <bblack>	 I wish that alert was more-informative
[16:56:37] <bblack>	 XioNoX: ^
[16:56:46] <AntiComposite>	 "It's broken, please fix"
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1700).
[17:00:41] <subbu>	 no parsoid deploy today
[17:01:37] <XioNoX>	 bblack: that's knows one asw2 doesn't have a 2nd power connected
[17:02:05] <wikibugs>	 (03CR) 10Subramanya Sastry: Parsoid/PHP: Load the extension on all Parsoid nodes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac)
[17:02:21] <XioNoX>	 i can't just mute that alarm without muting everything on the device :(
[17:02:25] <bblack>	 ok
[17:04:41] <wikibugs>	 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10Dzahn)
[17:05:58] <wikibugs>	 (03PS1) 10BBlack: esams mgmt dns for rack 14 [dns] - 10https://gerrit.wikimedia.org/r/545895 (https://phabricator.wikimedia.org/T236294)
[17:06:11] <wikibugs>	 (03PS5) 10Aaron Schulz: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967
[17:06:28] <wikibugs>	 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` bast3004.wikimedia.org ` The log can be found in `/var/log/wmf-auto-reimage/201910241705_dzahn_24181_bast3004_wikimedia_org.log`.
[17:06:30] <wikibugs>	 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['bast3004.wikimedia.org'] `  Of which those **FAILED**: ` ['bast3004.wikimedia.org'] `
[17:08:06] <wikibugs>	 (03PS2) 10BBlack: esams mgmt dns for rack 14 [dns] - 10https://gerrit.wikimedia.org/r/545895 (https://phabricator.wikimedia.org/T236294)
[17:08:17] <wikibugs>	 (03CR) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek)
[17:09:10] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] esams mgmt dns for rack 14 [dns] - 10https://gerrit.wikimedia.org/r/545895 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[17:09:32] <wikibugs>	 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` bast3004.wikimedia.org ` The log can be found in `/var/log/wmf-auto-reimage/201910241709_dzahn_24962_bast3004_wikimedia_org.log`.
[17:10:34] <wikibugs>	 (03PS4) 10Mobrovac: Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898)
[17:12:52] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] "Ok, lets go with this ... no need to be ultra paranoid about the linter flag in scenarios where we get $wgReadOnly set to false .. we will" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac)
[17:15:57] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[17:16:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:37] <wikibugs>	 (03CR) 10Mobrovac: [C: 03+2] Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac)
[17:18:48] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[17:18:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:36] <wikibugs>	 (03Merged) 10jenkins-bot: Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac)
[17:24:18] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond)
[17:26:00] <logmsgbot>	 !log mobrovac@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable Parsoid/PHP in the whole wtp (a.k.a. Parsoid) cluster - T236388 (duration: 00m 53s)
[17:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:05] <stashbot>	 T236388: Linting is disabled on beta cluster, but needs to be enabled - https://phabricator.wikimedia.org/T236388
[17:27:45] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3007.esams.wmnet'] `  and were **ALL** successful.
[17:29:27] <bblack>	 !log asw2-esams - committing switch port/vlan config for new rack 14 hosts
[17:29:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:25] <wikibugs>	 (03PS1) 10Matthias Mullie: Enable Wikibase client access on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792)
[17:35:06] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) merging in duplicate ticket T236409 where i started OS install
[17:35:32] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn)
[17:35:33] <wikibugs>	 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10Dzahn)
[17:35:55] <wikibugs>	 (03CR) 10WMDE-leszek: [C: 03+1] Enable Wikibase client access on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) (owner: 10Matthias Mullie)
[17:36:38] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn)
[17:37:25] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) confirmed mgmt and production DNS exists, mgtm password is set, IPMI over LAN working..    started OS install
[17:38:38] <ema>	 !log pool cp3059 (cache_upload) T233242
[17:38:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:43] <stashbot>	 T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242
[17:39:55] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[17:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:58] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[17:42:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:00] <wikibugs>	 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['bast3004.wikimedia.org'] `  and were **ALL** successful.
[17:58:13] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) ` [bast3004:~] $ gen_fingerprints  +---------+---------+-----------------------------------------------------+  | Cipher  | Algo    | Fingerprint                                         |...
[18:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1800).
[18:00:04] <jouncebot>	 awight, urandom, and matthiasmullie: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:12] * addshore is watching
[18:00:16] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn)
[18:00:25] <awight>	 Present :)
[18:00:28] <icinga-wm>	 PROBLEM - Host ps1-b4-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[18:00:28] <Urbanecm>	 I can SWAT today!
[18:00:34] <matthiasmullie>	 o/
[18:00:50] <awight>	 Urbanecm: thank you!
[18:01:05] <Urbanecm>	 awight: is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/545260 a beta-only patch?
[18:01:24] <awight>	 Urbanecm: No, it's a beta feature for production.
[18:01:34] <Urbanecm>	 ok, a different beta then :)
[18:01:45] <awight>	 We should call the beta cluster the "alpha" cluster or something :-)
[18:01:54] <wikibugs>	 (03PS2) 10Urbanecm: Reference Previews: full beta deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545260 (https://phabricator.wikimedia.org/T235083) (owner: 10Awight)
[18:02:09] <Urbanecm>	 well, even our prod wikis runs on official alpha versions of MW :)
[18:02:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545260 (https://phabricator.wikimedia.org/T235083) (owner: 10Awight)
[18:02:18] <awight>	  /o\ good point
[18:02:33] <urandom>	 o/
[18:02:40] <Urbanecm>	 hi urandom 
[18:02:46] <Urbanecm>	 your patch will follow after awight 's
[18:03:03] <urandom>	 Urbanecm: k
[18:03:03] <wikibugs>	 (03PS1) 10BBlack: esams: macaddrs for all new rack 14 hosts [puppet] - 10https://gerrit.wikimedia.org/r/545908 (https://phabricator.wikimedia.org/T236294)
[18:03:07] <wikibugs>	 (03Merged) 10jenkins-bot: Reference Previews: full beta deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545260 (https://phabricator.wikimedia.org/T235083) (owner: 10Awight)
[18:03:30] <Urbanecm>	 awight: could you check your patch at mwdebug1001, please?
[18:03:38] <awight>	 Urbanecm: checking...
[18:03:39] <robh>	 !log setting ip info for ps1-a6-eqiad, it is rebooting. T227142
[18:03:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:44] <stashbot>	 T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142
[18:04:18] <wikibugs>	 (03PS6) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074)
[18:04:40] <awight>	 Urbanecm: It's healthy, ready for deployment.
[18:04:40] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] esams: macaddrs for all new rack 14 hosts [puppet] - 10https://gerrit.wikimedia.org/r/545908 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack)
[18:04:57] <wikibugs>	 (03PS4) 10Urbanecm: rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans)
[18:05:07] <Urbanecm>	 good, syncing
[18:05:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans)
[18:05:26] <icinga-wm>	 RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.13 ms
[18:05:43] <wikibugs>	 (03PS1) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389)
[18:05:56] <wikibugs>	 (03Merged) 10jenkins-bot: rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans)
[18:05:59] <wikibugs>	 (03CR) 10Volans: "Thanks for the review, replies inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[18:06:17] <wikibugs>	 (03PS2) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689
[18:06:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[18:06:29] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: b20d6de: Reference Previews: full beta deployment (T235083) (duration: 00m 52s)
[18:06:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:36] <stashbot>	 T235083: Full Beta for the ReferencePreviews feature - https://phabricator.wikimedia.org/T235083
[18:06:39] <Urbanecm>	 hauskater: thanks for doing the initial config thingie!
[18:06:52] <hauskater>	 I was bored and had a bit of time :)
[18:07:12] <Urbanecm>	 urandom: if possible, please test your patch at mwdebug1001
[18:07:41] <wikibugs>	 (03PS2) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389)
[18:07:57] <wikibugs>	 (03CR) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. (033 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus)
[18:08:02] <icinga-wm>	 PROBLEM - ps1-a6-eqiad-infeed-load-tower-A-phase-X on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:08:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[18:08:21] <urandom>	 Urbanecm: hrmm, I'm not sure I know how to do that
[18:08:41] <urandom>	 🤔
[18:08:48] <icinga-wm>	 PROBLEM - ps1-a6-eqiad-infeed-load-tower-B-phase-Z on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:11] <Urbanecm>	 urandom: I'd simply use something that needs the service to ensure it doesn't fail
[18:09:18] <icinga-wm>	 PROBLEM - ps1-a6-eqiad-infeed-load-tower-A-phase-Z on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:18] <icinga-wm>	 PROBLEM - ps1-a6-eqiad-infeed-load-tower-B-phase-X on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:18] <icinga-wm>	 PROBLEM - ps1-a6-eqiad-infeed-load-tower-A-phase-Y on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:24] <icinga-wm>	 PROBLEM - ps1-a6-eqiad-infeed-load-tower-B-phase-Y on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:29] <Urbanecm>	 or I can just sync and verify that in fatalmonitor
[18:09:30] <urandom>	 Urbanecm: can testwiki be accessed there?
[18:09:36] <wikibugs>	 (03PS3) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389)
[18:09:53] <Urbanecm>	 urandom: yes, it's a stagging production server, see https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug for manual
[18:10:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[18:11:21] <wikibugs>	 (03PS4) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389)
[18:11:26] <wikibugs>	 10Operations, 10Wikimedia-production-error: mwdebug1001 throws "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" - https://phabricator.wikimedia.org/T236401 (10Krinkle)
[18:11:32] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] metamonitoring: add sync of Icinga contacts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans)
[18:11:41] <hauskater>	 always a comma
[18:11:50] <wikibugs>	 (03PS3) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689
[18:12:13] <wikibugs>	 10Operations, 10Wikimedia-production-error: mwdebug1001 throws "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" - https://phabricator.wikimedia.org/T236401 (10Krinkle)
[18:12:27] <wikibugs>	 (03PS5) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389)
[18:12:39] <wikibugs>	 10Operations, 10Wikimedia-production-error: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10Krinkle)
[18:13:10] <urandom>	 Urbanecm: perfect
[18:13:19] <wikibugs>	 10Operations, 10Wikimedia-production-error: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10Krinkle) Does not affect production (debug server) and does not seem to affect MediaWiki error logging, either. I...
[18:13:43] <urandom>	 Urbanecm: looks good
[18:13:55] <Urbanecm>	 urandom: syncing then
[18:15:35] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/: SWAT: 84c48df: rename service definition (T222851) (duration: 00m 53s)
[18:15:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:15:42] <stashbot>	 T222851: Improve Echo seentime code for multi-DC access - https://phabricator.wikimedia.org/T222851
[18:15:45] <Urbanecm>	 urandom: synced
[18:15:57] <wikibugs>	 (03PS2) 10Urbanecm: Enable Wikibase client access on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) (owner: 10Matthias Mullie)
[18:16:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) (owner: 10Matthias Mullie)
[18:16:29] <Urbanecm>	 matthiasmullie: +2'ed your patch, waiting for CI
[18:16:34] <urandom>	 Urbanecm: thanks!
[18:16:39] <Urbanecm>	 yw urandom 
[18:16:46] <matthiasmullie>	 Urbanecm: cool thanks
[18:17:03] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Wikibase client access on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) (owner: 10Matthias Mullie)
[18:17:40] <Urbanecm>	 matthiasmullie: your patch is at mwdebug1001, could you test please?
[18:18:00] <matthiasmullie>	 sure
[18:18:05] <Urbanecm>	 thanks
[18:20:04] <robh>	 !log ps1-a6-eqiad setup complete, icinga errors should clear up T227142
[18:20:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:08] <stashbot>	 T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142
[18:20:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10RobH) 05Open→03Resolved
[18:20:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[18:21:02] <icinga-wm>	 RECOVERY - ps1-a6-eqiad-infeed-load-tower-B-phase-X on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-B-phase-X 393 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:21:02] <icinga-wm>	 RECOVERY - ps1-a6-eqiad-infeed-load-tower-A-phase-Z on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-A-phase-Z 412 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:21:02] <icinga-wm>	 RECOVERY - ps1-a6-eqiad-infeed-load-tower-A-phase-Y on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-A-phase-Y 367 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:21:04] <icinga-wm>	 RECOVERY - ps1-a6-eqiad-infeed-load-tower-B-phase-Z on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-B-phase-Z 339 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:21:04] <icinga-wm>	 RECOVERY - ps1-a6-eqiad-infeed-load-tower-B-phase-Y on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-B-phase-Y 414 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:21:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) a:05Cmjohnson→03RobH
[18:21:56] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs3007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[18:22:05] <robh>	 !log completing ps1-b6-eqiad setup, pdu will reboot twice, power output unaffected T227540
[18:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:09] <stashbot>	 T227540: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540
[18:23:02] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn)
[18:23:04] <wikibugs>	 10Operations, 10DC-Ops, 10decommission: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn)
[18:23:06] <robh>	 There goes one of the restarts
[18:23:15] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs3005.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191024182...
[18:23:19] <robh>	 lets see how quickly librenms detects, i assume after its back online
[18:23:34] <wikibugs>	 (03PS1) 10Dzahn: site: replace bast3002 with bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/545911 (https://phabricator.wikimedia.org/T236329)
[18:23:56] <matthiasmullie>	 Urbanecm: well... it's not working, but at least nothing else seems broken ^^
[18:24:20] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns3001.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto-reimage/2...
[18:24:29] <Urbanecm>	 matthiasmullie: I'm stupid, I didn't actually pull it :(
[18:24:47] <Urbanecm>	 matthiasmullie: could you try once more please?
[18:24:48] <matthiasmullie>	 haha :D
[18:25:11] <matthiasmullie>	 sure :)
[18:25:14] <Urbanecm>	 thanks
[18:25:21] <urandom>	 !log restbase cassandra rolling restart, rack 'a' -- T200803
[18:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:27] <wikibugs>	 (03CR) 10Nuria: [C: 04-1] "Got it, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[18:25:29] <stashbot>	 T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803
[18:25:45] <matthiasmullie>	 Urbanecm: seems to work now :) give me 2 min to verify nothing else broke!
[18:25:51] <Urbanecm>	 sure
[18:25:52] <wikibugs>	 (03PS1) 10BBlack: Add dns3001 to ntp peers list [puppet] - 10https://gerrit.wikimedia.org/r/545912 (https://phabricator.wikimedia.org/T236217)
[18:26:22] <icinga-wm>	 RECOVERY - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed
[18:26:31] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add dns3001 to ntp peers list [puppet] - 10https://gerrit.wikimedia.org/r/545912 (https://phabricator.wikimedia.org/T236217) (owner: 10BBlack)
[18:28:15] <librenms-wmf>	 04Critical Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - Device rebooted
[18:28:35] <matthiasmullie>	 Urbanecm: LGTM - let's go :)
[18:28:38] <icinga-wm>	 RECOVERY - Host ps1-b4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms
[18:28:42] <Urbanecm>	 matthiasmullie: syncing
[18:28:44] <wikibugs>	 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn) a:03Dzahn @fgiunchedi ACK, so no data needs to be copied from bast3002 to bast3004, the new bastion? Instead it moves to a VM?
[18:28:46] <icinga-wm>	 RECOVERY - ps1-a6-eqiad-infeed-load-tower-A-phase-X on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-A-phase-X 378 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:29:42] <icinga-wm>	 PROBLEM - ps1-b4-eqiad-infeed-load-tower-A-phase-Z on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:29:58] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 263fd0f: Enable Wikibase client access on commonswiki (T223792) (duration: 00m 52s)
[18:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:03] <stashbot>	 T223792: Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items - https://phabricator.wikimedia.org/T223792
[18:30:37] <Urbanecm>	 matthiasmullie: done!
[18:30:40] <bblack>	 !log cr2-esams: add dns3001 to anycast4 neighbors
[18:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:48] <matthiasmullie>	 Urbanecm: thanks!
[18:30:51] <Urbanecm>	 yw
[18:31:05] <bblack>	 !log cr3-esams: add dns3001 to anycast4 neighbors
[18:31:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:16] <icinga-wm>	 PROBLEM - ps1-b4-eqiad-infeed-load-tower-A-phase-Y on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:31:20] <icinga-wm>	 PROBLEM - ps1-b4-eqiad-infeed-load-tower-B-phase-Y on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:31:36] <icinga-wm>	 PROBLEM - ps1-b4-eqiad-infeed-load-tower-A-phase-X on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:31:55] <cdanis>	 ugh the pdu alerts are noisy
[18:31:58] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[18:32:06] <icinga-wm>	 PROBLEM - ps1-b4-eqiad-infeed-load-tower-B-phase-X on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:34:00] <Urbanecm>	 matthiasmullie: I saw a spike of errors like https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2019.10.24/mediawiki/?id=AW3_C37YghP2xm4v6HnC
[18:34:02] <Urbanecm>	 please have a look
[18:36:00] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3061.esams.wmnet', 'cp3062.esams.wmnet', 'cp3063.esams.wmnet', 'cp3064.e...
[18:36:03] <matthiasmullie>	 checking
[18:36:03] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3050.esams.wmnet', 'cp3051.esams.wmnet', 'cp3052.esams.wmnet', 'cp3053.e...
[18:36:21] <matthiasmullie>	 ongoing? or 1 temporary spike
[18:36:28] <wikibugs>	 (03PS1) 10RobH: setting sentry4 for ps1-b4-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/545915 (https://phabricator.wikimedia.org/T227540)
[18:36:42] <Urbanecm>	 matthiasmullie: see https://logstash.wikimedia.org/goto/10c278413af17751ff822d2388bb8a2a
[18:36:48] <wikibugs>	 (03PS1) 10Mobrovac: RESTRouter: Add ka.wm.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389)
[18:37:12] <Urbanecm>	 it seems to be ongoing
[18:37:34] <wikibugs>	 (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: Add ka.wm.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac)
[18:38:22] <matthiasmullie>	 seems to be ongoing, yes
[18:38:22] <bblack>	 mobrovac: ka or kl ?
[18:38:26] <wikibugs>	 (03PS3) 10Mathew.onipe: wdqs: Use a DRYer approach to check selected hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/545673
[18:38:28] <wikibugs>	 (03PS12) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588)
[18:38:49] <wikibugs>	 (03CR) 10RobH: [C: 03+2] setting sentry4 for ps1-b4-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/545915 (https://phabricator.wikimedia.org/T227540) (owner: 10RobH)
[18:38:59] <Urbanecm>	 matthiasmullie: by looking at today version of same search, it seems it can well be just a coincidence - there was similar spike at around 10am
[18:39:13] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn)
[18:39:17] <matthiasmullie>	 also odd: it worked on mwdebug1001 (and still does, also on mwdebug2001) - it doesn't work on whatever random other server I'm on
[18:39:38] <Urbanecm>	 matthiasmullie: then it's probably cache
[18:40:00] <Urbanecm>	 try refreshing sev times with ctrl+shift+r or from inkognito window
[18:40:01] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn)
[18:40:18] <wikibugs>	 (03CR) 10BBlack: RESTRouter: Add ka.wm.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac)
[18:40:50] <matthiasmullie>	 Urbanecm: you ok with giving it 10 more minutes or so, and reverting if things don't settle down then?
[18:41:00] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3001.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910241840_dzahn_4514...
[18:41:02] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3001.esams.wmnet'] `  Of which those **FAILED**: ` ['ganeti3001.esams.wmnet'] `
[18:41:04] <Urbanecm>	 matthiasmullie: yup
[18:41:09] <matthiasmullie>	 (or if errors start spiking more)
[18:41:25] <matthiasmullie>	 cool, fingers crossed
[18:41:31] <wikibugs>	 (03PS4) 10Eevans: [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[18:41:33] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3001.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910241841_dzahn_4528...
[18:42:05] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:42:05] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:42:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH)
[18:42:08] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:42:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[18:42:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[18:42:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) 05Open→03Resolved All changes merged, when puppet runs on icinga it'll clear the alerts.
[18:42:25] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3002.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910241842_dzahn_4545...
[18:42:44] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) a:05RobH→03None
[18:43:00] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe)
[18:43:15] <librenms-wmf>	 04Critical Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - Device rebooted
[18:43:24] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3003.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910241843_dzahn_4563...
[18:43:26] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: Use a DRYer approach to check selected hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 (owner: 10Mathew.onipe)
[18:43:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[18:43:37] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10RobH) 05Open→03Resolved
[18:43:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) a:05RobH→03None
[18:44:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10RobH) a:05RobH→03None
[18:44:12] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10RobH) a:05RobH→03None
[18:44:38] <icinga-wm>	 RECOVERY - ps1-b4-eqiad-infeed-load-tower-A-phase-Z on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-A-phase-Z 367 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:44:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: fix serial connection for ps1-a2-eqiad - https://phabricator.wikimedia.org/T235190 (10RobH) 05Open→03Resolved not sure how it was fixed, but it was fixed since we now setup the pdu.
[18:44:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10RobH)
[18:45:02] <icinga-wm>	 RECOVERY - ps1-b4-eqiad-infeed-load-tower-A-phase-Y on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-A-phase-Y 486 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:45:02] <wikibugs>	 (03PS1) 10Mobrovac: RESTRouter: s/klwm/kawm/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/545918
[18:45:06] <icinga-wm>	 RECOVERY - ps1-b4-eqiad-infeed-load-tower-B-phase-Y on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-B-phase-Y 489 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:45:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10RobH) a:05RobH→03None
[18:45:24] <icinga-wm>	 RECOVERY - ps1-b4-eqiad-infeed-load-tower-A-phase-X on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-A-phase-X 315 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:45:33] <wikibugs>	 (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: s/klwm/kawm/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/545918 (owner: 10Mobrovac)
[18:46:08] <urandom>	 !log restbase cassandra rolling restart, rack 'b' -- T200803
[18:46:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:13] <stashbot>	 T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803
[18:46:18] <wikibugs>	 (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: Add ka.wm.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac)
[18:46:30] <wikibugs>	 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway)
[18:46:41] <wikibugs>	 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway) p:05Triage→03High
[18:47:25] <wikibugs>	 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway)
[18:49:14] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-b4-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted
[18:49:54] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10colewhite) This issue is mitigated as of this UTC morning and confirm I am no longer seeing long delays of list email.
[18:50:08] <icinga-wm>	 PROBLEM - Host 2620:0:862:1:b226:28ff:fe6e:cfe0 is DOWN: PING CRITICAL - Packet loss = 100%
[18:51:03] <bblack>	 hmmm
[18:51:16] <bblack>	 did we forget ipv6 for one of these? :)
[18:52:45] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3005.esams.wmnet'] `  and were **ALL** successful.
[18:53:02] <matthiasmullie>	 Urbanecm: I'm still getting varying results based on what server serves the request
[18:53:06] <matthiasmullie>	 but the errors seems to have stopped
[18:53:17] <bblack>	 or it's a temporary bad icinga entry from icinga-ization preceding mapped-v6 or something
[18:53:21] <matthiasmullie>	 so they seem to have been unrelated or resolved
[18:53:31] <Urbanecm>	 matthiasmullie: indeed
[18:53:52] <Urbanecm>	 by what server serves the request you mean debug/ordinary, or even different results at ordinary servers?
[18:53:59] <Urbanecm>	 and have you tried in inkognito window?
[18:54:29] <matthiasmullie>	 some ordinary servers serve the expected result, some don't
[18:54:49] <matthiasmullie>	 the code is there, though - I checked all
[18:54:57] <Urbanecm>	 interesting
[18:55:02] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10ssingh) >>! In T235983#5603254, @ssingh wrote: > This bug is also affecting the https://lists.wikimedia.org/mailman/listinfo/traffic-anoma...
[18:55:21] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:55:22] <Urbanecm>	 !log Morning SWAT done
[18:55:22] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:55:22] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:55:22] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:55:23] <bblack>	 yeah that IPv6 host-down is from dns3001, some kind of race between configuration of proper ipv6 and icinga-ization during initial install.  it will fix itself later I think
[18:55:24] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:55:24] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:55:24] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:55:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:28] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:29] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:55:31] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:55:56] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn)
[18:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:23] <matthiasmullie>	 Urbanecm: I'll keep an eye on things, but we can probably leave the patch up
[18:56:24] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:28] <matthiasmullie>	 thanks!
[18:56:31] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:56:32] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[18:56:33] <Urbanecm>	 matthiasmullie: ack, and you're welcome
[18:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:29] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:57:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:28] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[18:59:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:30] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:00:31] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:00:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:25] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:31] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:01:33] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:01:33] <icinga-wm>	 PROBLEM - rsyslog in esams is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_wezen.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=esams+prometheus/ops
[19:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:14] <wikibugs>	 (03PS1) 10Dzahn: smokeping: replace bast3002 with bast3004 as target [puppet] - 10https://gerrit.wikimedia.org/r/545921 (https://phabricator.wikimedia.org/T236394)
[19:02:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[19:03:21] <icinga-wm>	 PROBLEM - Recursive DNS on 91.198.174.61 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS
[19:03:30] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:37] <wikibugs>	 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn) @fgiunchedi Is there already a ticket for setting up those new prometheus VMs?  I see it needs some related changes in puppet, so i'll block this ticket on that.
[19:05:24] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:05:26] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[19:05:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:38] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:05:59] <urandom>	 !log restbase cassandra rolling restart, rack 'd' -- T200803
[19:06:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:03] <stashbot>	 T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803
[19:06:24] <wikibugs>	 (03PS1) 10Dzahn: hieradata/common: add bast3004 to bastion hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/545922 (https://phabricator.wikimedia.org/T236394)
[19:06:26] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Archiva relies on a tmpfs directory that is wiped after each reboot - https://phabricator.wikimedia.org/T214366 (10Nuria) 05Open→03Resolved
[19:07:41] <icinga-wm>	 PROBLEM - Host cp3052 is DOWN: PING CRITICAL - Packet loss = 100%
[19:07:41] <icinga-wm>	 PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100%
[19:07:41] <icinga-wm>	 PROBLEM - Host cp3050 is DOWN: PING CRITICAL - Packet loss = 100%
[19:07:49] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:07:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:11] <wikibugs>	 (03PS6) 10Eevans: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[19:08:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[19:08:58] <icinga-wm>	 RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 83.41 ms
[19:08:58] <icinga-wm>	 RECOVERY - Host cp3050 is UP: PING OK - Packet loss = 0%, RTA = 83.38 ms
[19:08:58] <icinga-wm>	 RECOVERY - Host cp3052 is UP: PING OK - Packet loss = 0%, RTA = 83.52 ms
[19:08:58] <icinga-wm>	 RECOVERY - ps1-b4-eqiad-infeed-load-tower-B-phase-X on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-B-phase-X 302 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:08:59] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10Nuria) 05Open→03Resolved
[19:09:34] <icinga-wm>	 PROBLEM - Host ganeti3001 is DOWN: PING CRITICAL - Packet loss = 100%
[19:09:52] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[19:09:54] <mutante>	 well.. those are all failures to set downtime by the script i supose
[19:10:00] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[19:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:06] <mutante>	 (return code 99 furhter above)
[19:10:13] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3061 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:10:13] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp3050 is CRITICAL: connect to address 10.20.0.50 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:10:14] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3053 is CRITICAL: connect to address 10.20.0.53 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:10:14] <icinga-wm>	 PROBLEM - check_trafficserver_backend_config_status on cp3061 is CRITICAL: NRPE: Command check_check_trafficserver_backend_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:10:14] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp3063 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:10:14] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp3064 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:10:14] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance tls on cp3065 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:10:14] <icinga-wm>	 PROBLEM - Check systemd state on cp3065 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:10:34] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:10:48] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp3050 is CRITICAL: connect to address 10.20.0.50 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[19:10:54] <mutante>	 ^ installation in progress but downtime should have been set automatically by wmf-reimage 
[19:11:36] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: connect to address 10.20.0.52 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:11:38] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance backend on cp3051 is CRITICAL: NRPE: Command check_traffic_manager_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:11:38] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance tls on cp3050 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:11:38] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[19:11:38] <icinga-wm>	 PROBLEM - eventlogging Varnishkafka log producer on cp3054 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:11:38] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp3063 is CRITICAL: NRPE: Command check_trafficserver_exporter_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:11:40] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3064 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:11:40] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp3065 is CRITICAL: NRPE: Command check_traffic_server_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:11:40] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp3065 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:11:40] <icinga-wm>	 RECOVERY - Host ganeti3001 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms
[19:12:14] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.1 200 Ok - 29969 bytes in 0.427 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:13:16] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance tls on cp3051 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:13:16] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp3050 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:13:16] <icinga-wm>	 PROBLEM - Check systemd state on cp3051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:18] <icinga-wm>	 PROBLEM - check_trafficserver_backend_config_status on cp3063 is CRITICAL: NRPE: Command check_check_trafficserver_backend_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:13:18] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3063 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:13:18] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp3065 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:13:18] <icinga-wm>	 PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3062 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:13:18] <bblack>	 hmmm downtimes seem unreliable, sorry for the noise
[19:13:42] <icinga-wm>	 PROBLEM - Check systemd state on cp3064 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:42] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3001.esams.wmnet'] `  and were **ALL** successful.
[19:13:50] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[19:13:58] <icinga-wm>	 PROBLEM - Host cp3054 is DOWN: PING CRITICAL - Packet loss = 100%
[19:15:16] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3050 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:16] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance backend on cp3051 is CRITICAL: NRPE: Command check_traffic_server_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:16] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance tls on cp3052 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:16] <icinga-wm>	 PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:16] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp3051 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:15:16] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance backend on cp3053 is CRITICAL: NRPE: Command check_traffic_manager_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:17] <icinga-wm>	 PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3061 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:18] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp3065 is CRITICAL: NRPE: Command check_trafficserver_exporter_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:15:39] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3003.esams.wmnet'] `  and were **ALL** successful.
[19:15:55] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3002.esams.wmnet'] `  and were **ALL** successful.
[19:17:24] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp3051 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:17:24] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp3052 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:17:24] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance tls on cp3053 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:17:24] <icinga-wm>	 PROBLEM - Check systemd state on cp3053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:26] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3065 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:17:26] <icinga-wm>	 PROBLEM - check_trafficserver_backend_config_status on cp3065 is CRITICAL: NRPE: Command check_check_trafficserver_backend_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:17:26] <icinga-wm>	 PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3064 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:18:26] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance tls on cp3053 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:18:26] <icinga-wm>	 RECOVERY - Check systemd state on cp3053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:46] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance backend on cp3053 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:19:00] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3053 is OK: HTTP OK: HTTP/1.0 200 OK - 19414 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:20:44] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3051 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:20:44] <icinga-wm>	 PROBLEM - check_trafficserver_backend_config_status on cp3051 is CRITICAL: NRPE: Command check_check_trafficserver_backend_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:20:46] <icinga-wm>	 PROBLEM - eventlogging Varnishkafka log producer on cp3062 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:20:46] <icinga-wm>	 PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3050 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:20:56] <icinga-wm>	 PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:16] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp3063 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:21:18] <icinga-wm>	 RECOVERY - check_trafficserver_backend_config_status on cp3063 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:21:18] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3063 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:21:28] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp3063 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:21:56] <icinga-wm>	 RECOVERY - eventlogging Varnishkafka log producer on cp3062 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:22:26] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3061 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:22:28] <icinga-wm>	 RECOVERY - check_trafficserver_backend_config_status on cp3061 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:22:32] <icinga-wm>	 RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3062 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:22:54] <icinga-wm>	 PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3065 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:23:00] <thcipriani>	 jouncebot: now
[19:23:00] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 36 minute(s)
[19:23:04] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3052 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:23:04] <thcipriani>	 bblack: are you in the middle of something? I wanted to do a quick gerrit restart: would that interfere with what you're doing?
[19:23:26] <icinga-wm>	 RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3061 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:23:46] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp3050 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:24:00] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance tls on cp3050 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:24:30] <icinga-wm>	 RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3050 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:24:40] <mutante>	 bblack: ack, i had it too. (exit_code=99) when it tries to set downtimes
[19:24:42] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3050 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:24:44] <icinga-wm>	 RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:02] <icinga-wm>	 PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3052 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:25:04] <icinga-wm>	 PROBLEM - eventlogging Varnishkafka log producer on cp3064 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:25:08] <icinga-wm>	 RECOVERY - Check systemd state on cp3051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:25:08] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance tls on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:25:20] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance backend on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:25:36] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp3051 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:25:40] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3052 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:25:48] <icinga-wm>	 RECOVERY - check_trafficserver_backend_config_status on cp3051 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:25:48] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:02] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:02] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance tls on cp3052 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:02] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:26:02] <icinga-wm>	 RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:26:20] <icinga-wm>	 RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3052 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:36] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 19440 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:26:37] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn)
[19:26:58] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp3052 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:27:02] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3065 is CRITICAL: connect to address 10.20.0.65 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:28:01] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn) - OS installed - in puppet with spare role - set to "staged" in netbox  Are all previous boxes already done?  If so then it can be handed over to Filippo i think.
[19:28:04] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:28:06] <icinga-wm>	 RECOVERY - Check systemd state on cp3064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:28:12] <icinga-wm>	 RECOVERY - Recursive DNS on 91.198.174.61 is OK: DNS OK: 0.102 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS
[19:28:20] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp3051 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:28:28] <icinga-wm>	 RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3064 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:29:14] <icinga-wm>	 RECOVERY - eventlogging Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:29:16] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp3064 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:30:07] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3001.wikimedia.org'] `  and were **ALL** successful.
[19:30:11] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[19:30:14] <icinga-wm>	 RECOVERY - rsyslog in esams is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=esams+prometheus/ops
[19:30:58] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] hieradata/common: add bast3004 to bastion hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/545922 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn)
[19:31:02] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 667 days) https://wikitech.wikimedia.org/wiki/Logs
[19:31:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] hieradata/common: add bast3004 to bastion hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/545922 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn)
[19:31:25] <urandom>	 !log restbase cassandra rolling restart, codfw / rack 'b' -- T200803
[19:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:31] <stashbot>	 T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803
[19:31:50] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp3053 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:32:17] <bblack>	 !log reboot dns3001
[19:32:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:51] <wikibugs>	 (03PS7) 10Eevans: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[19:33:10] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance backend on cp3065 is CRITICAL: NRPE: Command check_traffic_manager_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:34:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[19:34:18] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti3001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:35:14] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:35:38] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance tls on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:35:38] <icinga-wm>	 RECOVERY - Check systemd state on cp3065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:35:40] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp3065 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:35:56] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance backend on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:35:58] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:36:08] <icinga-wm>	 PROBLEM - Host 2620:0:862:1:b226:28ff:fe6e:cfe0 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:1:b226:28ff:fe6e:cfe0)
[19:36:10] <icinga-wm>	 RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3065 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:36:13] <wikibugs>	 (03PS8) 10Eevans: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[19:36:16] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3062.esams.wmnet', 'cp3063.esams.wmnet', 'cp3061.esams.wmnet', 'cp3064.esams.wmnet', 'cp3065.esams.wmnet'] `  and were **A...
[19:36:18] <icinga-wm>	 RECOVERY - check_trafficserver_backend_config_status on cp3065 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:36:18] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:36:22] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3065 is OK: HTTP OK: HTTP/1.0 200 OK - 19417 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:36:30] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance backend on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:39:40] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp3062 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:41:30] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1045 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:41:40] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:41:40] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp3061 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:41:56] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti3003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:43:04] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs3005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:43:40] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp3064 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:45:38] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp3063 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:46:03] <mutante>	 ^ known .. silencing the PDU part of it
[19:46:49] <wikibugs>	 (03PS2) 10CDanis: swift alerts: check over https when appropriate [puppet] - 10https://gerrit.wikimedia.org/r/545874
[19:47:03] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp3050 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:22] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3050 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:22] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3051 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:22] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3052 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:22] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3053 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:22] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3061 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:22] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3062 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:22] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3063 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:23] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3064 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:23] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp3065 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:24] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti3001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:24] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti3003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:25] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on lvs3005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:47:25] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on lvs3007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:48:24] <mutante>	 one last spam but it won't repeat now until the next change and then we see those once we got the second power supply
[19:49:22] <bblack>	 I'm still getting the cps through their bringup stuff, not expecting them to alert-free yet
[19:50:03] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19051/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545874 (owner: 10CDanis)
[19:50:21] <icinga-wm>	 RECOVERY - Host cp3054 is UP: PING OK - Packet loss = 0%, RTA = 83.51 ms
[19:52:05] <mutante>	 bblack: do you want to see the alerts or should i downtime ?
[19:52:07] <icinga-wm>	 PROBLEM - traffic-pool service on cp3054 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:52:33] <icinga-wm>	 PROBLEM - eventlogging Varnishkafka log producer on cp3054 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:52:39] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[19:52:41] <icinga-wm>	 PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3054 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:52:41] <icinga-wm>	 PROBLEM - Ensure traffic_server is running for instance tls on cp3054 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:52:41] <icinga-wm>	 PROBLEM - Check systemd state on cp3054 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:52:43] <icinga-wm>	 PROBLEM - Webrequests Varnishkafka log producer on cp3054 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:52:49] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[19:52:53] <Volker_E>	 I need help with https://phabricator.wikimedia.org/T235677, as I'm (and the Design team) is blocked on it.
[19:53:11] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:53:11] <icinga-wm>	 PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:53:15] <icinga-wm>	 PROBLEM - HTTPS Unified ECDSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:53:25] <icinga-wm>	 PROBLEM - Ensure traffic_manager is running for instance tls on cp3054 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[19:53:29] <icinga-wm>	 PROBLEM - HTTPS Unified RSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[19:53:31] <icinga-wm>	 PROBLEM - statsv Varnishkafka log producer on cp3054 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[19:54:32] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on bast3004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[19:54:44] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[19:55:24] <mutante>	 scheduled 1h downtime for services on cp3054
[19:56:28] <icinga-wm>	 RECOVERY - traffic-pool service on cp3054 is OK: OK - traffic-pool is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[19:57:30] <bblack>	 !log cp3054 - trying racadm serveraction hardreset
[19:57:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:59:21] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn)
[19:59:22] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Dzahn)
[19:59:24] <wikibugs>	 (03CR) 10MarcoAurelio: "This step (uploading a patch against this repo) doesn't appear to be documented at https://wikitech.wikimedia.org/wiki/Add_a_wiki. We shou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac)
[19:59:27] <wikibugs>	 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Dzahn)
[20:00:17] <wikibugs>	 (03CR) 10BPirkle: [C: 03+1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope)
[20:01:14] <wikibugs>	 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn) 05Open→03Stalled These need to stay on role(spare) for now, per bblack "we haven't figured out the edge DC ganeti cluster configs yet"
[20:01:24] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:01:28] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp3054 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 599712 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS
[20:01:28] <icinga-wm>	 RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp3054 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 546072 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS
[20:01:28] <icinga-wm>	 RECOVERY - HTTPS Unified ECDSA on cp3054 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 599711 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS
[20:02:03] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790 (10Dzahn) 05Stalled→03Open
[20:02:05] <wikibugs>	 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Dzahn)
[20:02:40] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:02:45] <wikibugs>	 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Dzahn)
[20:02:46] <icinga-wm>	 RECOVERY - Webrequests Varnishkafka log producer on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[20:03:26] <icinga-wm>	 RECOVERY - Ensure traffic_manager is running for instance tls on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:03:36] <icinga-wm>	 RECOVERY - eventlogging Varnishkafka log producer on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[20:03:52] <icinga-wm>	 RECOVERY - Check systemd state on cp3054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:03:52] <icinga-wm>	 RECOVERY - Ensure traffic_server is running for instance tls on cp3054 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:03:52] <icinga-wm>	 RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:04:02] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:04:05] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3053.esams.wmnet', 'cp3050.esams.wmnet', 'cp3051.esams.wmnet', 'cp3052.esams.wmnet', 'cp3054.esams.wmnet'] `  and were **A...
[20:04:34] <bblack>	 !log reboot cp3054 again for good measure
[20:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:30] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish
[20:07:24] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on csw2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi known, only 1 poer feed until tomorrow - The acknowledgement expires at: 2019-10-25 12:06:55. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[20:09:57] <icinga-wm>	 RECOVERY - statsv Varnishkafka log producer on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka
[20:11:59] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) ^ made active bastion host with the global firewall change above  created wikitech pages  https://wikitech.wikimedia.org/wiki/Bast3004 https://wikitech.wikimedia.org/w...
[20:15:05] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul)
[20:17:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] smokeping: replace bast3002 with bast3004 as target [puppet] - 10https://gerrit.wikimedia.org/r/545921 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn)
[20:19:10] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544881 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans)
[20:21:23] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul)
[20:22:41] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3054.esams.wmnet
[20:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:52] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Papaul)
[20:23:00] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3053.esams.wmnet
[20:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:10] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul)
[20:24:13] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3033.esams.wmnet
[20:24:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:31] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3038.esams.wmnet
[20:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:45] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) a:05Papaul→03Dzahn
[20:25:04] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) service "bastion host" is ready but service "tftp" still needs to be migrated. taking it.
[20:29:15] <wikibugs>	 (03PS1) 10CDanis: wikimedia.org: add performance-graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870)
[20:30:45] <wikibugs>	 (03PS2) 10CDanis: wikimedia.org: add performance-graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870)
[20:33:12] <wikibugs>	 (03CR) 10Bstorm: "For reference, the puppet compiler bit: https://puppet-compiler.wmflabs.org/compiler1001/19052/" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata)
[20:33:38] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata)
[20:33:50] <urandom>	 !log restbase cassandra rolling restart, codfw / rack 'c' -- T200803
[20:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:55] <stashbot>	 T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803
[20:36:40] <bblack>	 !log esams lvs: high-traffic1 - change 3003's med to 200, 3001's med to 50, 3005 remains 100 (traffic will blip to 3005 then back to 3001 again)
[20:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:56] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10RobH) p:05Triage→03Normal
[20:40:07] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10RobH)
[20:40:45] <bblack>	 !log esams lvs: high-traffic1 - change 3005's med to 0 (becomes new primary, permanently)
[20:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:16] <wikibugs>	 10Operations, 10ops-eqiad: rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10RobH) a:03Joe IRC update: Chatted with @joe and I'm assigning this to him first so he can evaluate where he wants to have all of these incoming mw hosts racked.  He'll update this task, and pl...
[20:42:12] <wikibugs>	 (03PS1) 10Dzahn: Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935
[20:43:57] <wikibugs>	 (03PS2) 10Dzahn: Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935
[20:44:39] <wikibugs>	 (03PS1) 10BBlack: esams text lvs: 5 is primary, [13] are backups [puppet] - 10https://gerrit.wikimedia.org/r/545936
[20:45:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "need something like this again for migration another bastion. first restoring and then moving to profile and adjusting IPs" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (owner: 10Dzahn)
[20:45:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (owner: 10Dzahn)
[20:46:09] <wikibugs>	 (03PS3) 10Dzahn: Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394)
[20:46:11] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] esams text lvs: 5 is primary, [13] are backups [puppet] - 10https://gerrit.wikimedia.org/r/545936 (owner: 10BBlack)
[20:48:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn)
[20:52:29] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata)
[20:58:10] <bblack>	 !log cr2-esams switch high-traffic1 static fallback routes from lvs3001 to lvs3005
[20:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:19] <bblack>	 !log cr3-esams switch high-traffic1 static fallback routes from lvs3001 to lvs3005
[20:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:39] <bblack>	 !log downtimed lvs3001-4, stopping pybal there, etc...
[21:02:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:35] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236438 (10ops-monitoring-bot)
[21:05:46] <urandom>	 !log restbase cassandra rolling restart, codfw / rack 'd' -- T200803
[21:05:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:53] <stashbot>	 T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803
[21:05:54] <bblack>	 !log cr2-esams remove pybal neighbor IPs for lvs3001-4
[21:05:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:17] <bblack>	 !log cr3-esams remove pybal neighbor IPs for lvs3001-4
[21:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:35] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236438 (10Bstorm) ` Enclosure Device ID: 32 Slot Number: 4 Enclosure position: 1 Device Id: 4 WWN: 55cd2e414dae9475 Sequence Number: 4 Media Error Count: 0 Other Error Count: 98287 Predictive Failure Count: 0 Last Pr...
[21:10:32] <wikibugs>	 (03PS4) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394)
[21:10:34] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236438 (10Bstorm)
[21:10:37] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10Bstorm)
[21:12:02] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10bd808)
[21:12:04] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236438 (10bd808)
[21:12:31] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3039.esams.wmnet
[21:12:34] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dns3001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[21:12:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn)
[21:13:00] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3044.esams.wmnet
[21:13:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:11] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3051.esams.wmnet
[21:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:01] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3050.esams.wmnet
[21:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:16:12] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3040.esams.wmnet
[21:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:12] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp3054 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[21:20:20] <wikibugs>	 (03PS1) 10Subramanya Sastry: Direct Parsoid/PHP logs to a parsoid-php logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899)
[21:22:58] <wikibugs>	 (03PS1) 10BBlack: lvs3001-4: unconfigure and move to spare [puppet] - 10https://gerrit.wikimedia.org/r/545946
[21:24:42] <wikibugs>	 (03CR) 10Subramanya Sastry: "In a few months time, once Parsoid/PHP logging has been fine-tuned and we are well past the deployment, we can revert this change if we de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899) (owner: 10Subramanya Sastry)
[21:27:39] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] lvs3001-4: unconfigure and move to spare [puppet] - 10https://gerrit.wikimedia.org/r/545946 (owner: 10BBlack)
[21:31:13] <wikibugs>	 (03PS5) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394)
[21:33:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn)
[21:44:44] <wikibugs>	 (03PS1) 10Dzahn: mediawiki::webserver: disable hhvm-needs-restart cron [puppet] - 10https://gerrit.wikimedia.org/r/545950 (https://phabricator.wikimedia.org/T229792)
[21:46:43] <wikibugs>	 10Operations, 10Traffic, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green), 10Patch-For-Review: Implement basic routing for rest.php - https://phabricator.wikimedia.org/T235779 (10WDoranWMF) @BBlack @ema would you have anytime to review Petr's patch above?
[21:49:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19053/mw1343.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/545950 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn)
[21:56:51] <wikibugs>	 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10Dzahn) merged the above because we were getting cron spam from appservers with "/usr/local/bin/hhvm-needs-restart: not found"
[21:59:53] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Volker_E) We've reverted Git LFS for now in https://github.com/wikimedi...
[22:00:23] <icinga-wm>	 RECOVERY - Check systemd state on mw1270 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:01:05] <mutante>	 !log mw1270 - was alerting in Icinga as degraded systemd state - reason was 'hhvm.service not-found".  systemctl reset-failed cleared it. could cause monitoring spam on more servers (T229792)
[22:01:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:10] <stashbot>	 T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792
[22:09:02] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[22:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:39] <thcipriani>	 !log stopping gerrit briefly for script run for T236344
[22:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:44] <stashbot>	 T236344: Some All-Users.git references are outdated after gerrit1001 migration - https://phabricator.wikimedia.org/T236344
[22:11:07] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:11:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:12] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[22:12:13] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[22:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:15] <logmsgbot>	 !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[22:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:19] <MatmaRex>	 gerrit's down, is that known?
[22:12:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:05] <mutante>	 MatmaRex: i dont think so, no. logging in
[22:13:08] <mutante>	 thcipriani: ^
[22:13:16] <subbu>	 MatmaRex, looks like thcipriani stopped it above for running a script
[22:13:16] <MatmaRex>	 "Gerrit is down. We're working on bringing it back as soon as possible.
[22:13:17] <MatmaRex>	 Please follow along the discussion at #wikimedia-operations on freenode as we debug.
[22:13:17] <MatmaRex>	 Please try again later!"
[22:13:29] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.downtime
[22:13:30] <icinga-wm>	 PROBLEM - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[22:13:32] <icinga-wm>	 PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:50] <icinga-wm>	 PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:13:51] <mutante>	 !log gerrit1001 - starting gerrit
[22:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:16] <subbu>	 mutante, looks like th.cipriani stopped it for T236344?
[22:14:19] <thcipriani>	 mutante: MatmaRex I logged it
[22:14:19] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:32] <thcipriani>	 was just about to start it again
[22:14:34] <icinga-wm>	 PROBLEM - Check the last execution of git_pull_charts on contint1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:14:44] <icinga-wm>	 RECOVERY - gerrit process on gerrit1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[22:15:03] <mutante>	 thcipriani: i hope starting it was ok ...
[22:15:04] <icinga-wm>	 RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:15:46] <icinga-wm>	 RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:15:50] <mutante>	 sorry, i also missed the log line somehow
[22:16:25] <thcipriani>	 mutante: should be fine. I was being paranoid. Preventing a very unlikely race condition. I ran the script right after stopping it, was checking some things when I noticed it was already started. I think should be ok :)
[22:16:26] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:38] <mutante>	 thcipriani: phew, ok!
[22:18:24] <wikibugs>	 (03PS1) 10Bstorm: maintain-kubeusers: add ability to merge and update configs [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202)
[22:19:58] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10wiki_willy) a:05Bstorm→03Jclark-ctr Talked to our Dell rep on this one, who can reach out to the Dell tech support rep directly, after we re-open the ticket.  He basically confirmed the same thing @Bsto...
[22:20:07] <wikibugs>	 (03PS6) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394)
[22:20:43] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on dns3001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams buildout WIP https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[22:21:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) Pointed this task out to our Dell account rep today.  @Jclark-ctr - let me know if the steps they provided don...
[22:24:45] <icinga-wm>	 RECOVERY - Check the last execution of git_pull_charts on contint1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[22:31:39] <wikibugs>	 (03PS7) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394)
[22:33:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[22:33:36] <wikibugs>	 (03PS4) 10Dzahn: Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[22:37:50] <wikibugs>	 (03CR) 10Dzahn: "@RLazarus: perfect example for a small change to Apache config that needs testing / deployment." [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[22:44:46] <wikibugs>	 (03PS8) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394)
[22:44:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19055/" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn)
[22:47:37] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10wiki_willy) a:03RobH @RobH - can you check if the configuration on this one is complete?  It was one of the PDUs you and Chris upgraded, when you went out to eqiad.  Thanks, Willy
[22:56:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10RobH) 05Open→03Resolved >>! In T227143#5604899, @wiki_willy wrote: > @RobH - can you check if the configuration on this one is complete?  It was one of the PDUs you and Chris upgraded, when you went o...
[22:56:42] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH)
[23:00:04] <jouncebot>	 MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:14] <wikibugs>	 (03PS1) 10Dzahn: bastionhost: fix migration class fqdn comparison, rename vars [puppet] - 10https://gerrit.wikimedia.org/r/545971 (https://phabricator.wikimedia.org/T236394)
[23:02:19] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10mmodell) So I think this should really be deployed with scap. That woul...
[23:03:21] <wikibugs>	 (03PS1) 10Dzahn: DHCP: replace bast3002 with bast3004 as next-server [puppet] - 10https://gerrit.wikimedia.org/r/545973 (https://phabricator.wikimedia.org/T236394)
[23:04:29] <Volker_E>	 mutante: would you be able to help with https://phabricator.wikimedia.org/T235677 once more?
[23:05:01] <wikibugs>	 (03PS2) 10Dzahn: DHCP: replace bast3002 with bast3004 as next-server [puppet] - 10https://gerrit.wikimedia.org/r/545973 (https://phabricator.wikimedia.org/T236394)
[23:06:28] <mutante>	 Volker_E: kind of busy right now. i think the latest comment by Mukunda is the way to go though. 
[23:06:42] <mutante>	 did you need a quick fix or something or in general
[23:06:56] <Volker_E>	 very quick fix 
[23:06:59] <twentyafterfour>	 mutante: not sure where to find the puppet logs for Microsites
[23:07:00] <Volker_E>	 we're blocked
[23:07:08] <Volker_E>	 we=Design team
[23:07:26] <mutante>	 twentyafterfour: what do you want to know from puppet log?
[23:07:27] <twentyafterfour>	 mutante: I can catch up with you later about it though if you are busy 
[23:07:50] <mutante>	 the error was already on the ticket, wasnt it
[23:07:51] <twentyafterfour>	 mutante: well, previously the git::clone had errors and we've force-pushed and update to gerrit so that it won't use git-lfs
[23:07:59] <twentyafterfour>	 but the change didn't show up apparently
[23:08:25] <mutante>	 ok, running puppet to see
[23:08:41] <twentyafterfour>	 I'm going to write a patch to switch it to scap, but I'll need review, will ping you about that later if it's ok 
[23:09:04] <mutante>	 Error: '/usr/bin/git  pull --quiet' returned 128 instead of one of [0]
[23:09:50] <twentyafterfour>	 mutante: thanks, go back to what you were doing I'll try to help Volker_E 
[23:09:56] <twentyafterfour>	 :) 
[23:09:56] <mutante>	 (/Stage[main]/Profile::Microsites::Design/Git::Clone[design/style-gu    ide]/Exec[git_pull_design/style-guide]/returns) fatal: refusing to merge unrelated histories
[23:09:59] <mutante>	 this says more
[23:10:08] <mutante>	 unrelated histories..
[23:10:16] <thcipriani>	 that's kind of what you'd expect if it was force-pushed
[23:10:16] <mutante>	 ok!
[23:10:49] <wikibugs>	 (03CR) 10Jon Harald Søby: "This might be too late, but our naming conventions for country-based community groups (and chapters) is to use countrycode.wikimedia.org. " [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[23:10:58] <thcipriani>	 that is, if your remote is completely different than your local copy and you pull quietly, git quietly refuses to act :)
[23:11:52] <wikibugs>	 (03PS2) 10Cwhite: admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209)
[23:12:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[23:12:24] <wikibugs>	 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ka.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10colewhite) p:05Triage→03Normal
[23:13:07] <twentyafterfour>	 thcipriani: right but shouldn't it be cloning fresh? 
[23:13:17] <thcipriani>	 IIRC this has something to do with the + in the refspec
[23:13:19] <wikibugs>	 10Operations, 10Wikimedia-production-error: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10colewhite) p:05Triage→03Normal
[23:13:26] <twentyafterfour>	 ahh I see 
[23:13:35] <twentyafterfour>	 ensure => latest causes it to pull
[23:13:56] <wikibugs>	 10Operations, 10observability: Tune HTTP availability alerts - https://phabricator.wikimedia.org/T236367 (10colewhite) p:05Triage→03Normal
[23:14:12] <mutante>	 yes. we can delete everything and then run puppet
[23:14:24] <twentyafterfour>	 mutante: that is probably the easiest fix 
[23:14:34] <twentyafterfour>	 but they need to stop making changes that require a force-push ;) 
[23:14:38] <mutante>	 but that could mean pages get cached while there is no content
[23:14:50] <twentyafterfour>	 mutante: yeah
[23:14:54] <thcipriani>	 > Since Git version 2.20, fetching to update refs/tags/* works the same way as when pushing. I.e. any updates will be rejected without + in the refspec (or --force)
[23:14:56] <mutante>	 and i wont be able to start purging them right now
[23:15:17] <thcipriani>	 so adding a + to the refspec may fix puppet
[23:15:19] <twentyafterfour>	 thcipriani: yeah I think ensure=>latest should probably add --force 
[23:15:28] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite)
[23:15:43] <thcipriani>	 I don't know if that's the least surprising thing it could do
[23:15:44] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite) p:05Triage→03Normal a:03colewhite
[23:16:25] <thcipriani>	 the consequence of adding --force could be worse than a bad puppet run
[23:16:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] bastionhost: fix migration class fqdn comparison, rename vars [puppet] - 10https://gerrit.wikimedia.org/r/545971 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn)
[23:17:09] <twentyafterfour>	 thcipriani: well, wouldn't that grab whatever is on gerrit, which matches my expectation for what is "latest"
[23:17:30] <twentyafterfour>	 in general it seems like a bad idea to have puppet ensuring latest 
[23:17:36] <thcipriani>	 ^
[23:17:53] <mutante>	 that's why we usually have 2 repos, normal and deployment-repo
[23:17:58] <twentyafterfour>	 right
[23:18:17] <twentyafterfour>	 scap is for sure the solution but trying to quickly get things unblocked 
[23:18:30] <twentyafterfour>	 I'll write up the scap patch
[23:18:48] * Krinkle staging on mwdebug1001
[23:19:02] <twentyafterfour>	 ~(a puppet patch to switch it to deploy via scap) 
[23:19:16] <Krinkle>	 twentyafterfour: OK to use scap now for a MW fix?
[23:19:28] <twentyafterfour>	 Krinkle: yes  I think so
[23:20:27] <Krinkle>	 Tchanders: staged on mwdebug1001
[23:20:40] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite)
[23:21:15] <Tchanders>	 Krinkle Thanks, checking
[23:21:34] <wikibugs>	 (03PS1) 10Cwhite: admin: add cglenn to researchers and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/545974 (https://phabricator.wikimedia.org/T236321)
[23:22:41] <Tchanders>	 Krinkle All good
[23:23:30] <wikibugs>	 (03PS3) 10Cwhite: admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209)
[23:24:12] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.3/includes/specials/pagers/BlockListPager.php: T236425,  fc99c5a7c0de2 (duration: 00m 54s)
[23:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:24:18] <stashbot>	 T236425: Fatal Error: "Call to a member function getId() on string" from BlockListPager.php - https://phabricator.wikimedia.org/T236425
[23:25:45] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 04-1] "Should be ge.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[23:26:23] <wikibugs>	 (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite)
[23:27:40] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 04-1] "Should be gewikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[23:30:03] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite) Hi CherRaye!  I added the checklist and need a couple things in order to proceed: an SSH public key and sign off...
[23:30:49] <wikibugs>	 (03PS1) 10Dzahn: bastionhost::migration: comment out rsync server, duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545976
[23:31:36] <wikibugs>	 (03PS2) 10Dzahn: bastionhost::migration: comment out rsync server, duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545976
[23:31:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] bastionhost::migration: comment out rsync server, duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545976 (owner: 10Dzahn)
[23:33:09] <wikibugs>	 (03PS1) 10Dzahn: Revert "Set DNS configuration for ka.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/545977
[23:33:57] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 4:" [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[23:35:26] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 8 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle)
[23:35:33] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 03+1] "Thanks. :-)" [dns] - 10https://gerrit.wikimedia.org/r/545977 (owner: 10Dzahn)
[23:36:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Set DNS configuration for ka.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/545977 (owner: 10Dzahn)
[23:37:47] <icinga-wm>	 RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports
[23:39:32] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "ack.  it should use "ge" instead." [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[23:41:27] <wikibugs>	 (03CR) 10Dzahn: RESTRouter: Add ka.wm.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac)
[23:41:34] <wikibugs>	 (03CR) 10Dzahn: ""ka" was actually a wrong request. it should use "ge" instead." [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac)
[23:43:19] <wikibugs>	 (03PS1) 10Dzahn: add ge.wikimedia.org for Georgia chapter [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389)
[23:46:41] <mutante>	 !log bast3002 - rsyncing /home, /srv/tfptboot and /srv/prometheus to /srv/bast3002/ on bast3004 (T236394 T236329)
[23:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:46:51] <stashbot>	 T236329: decommission bast3002 - https://phabricator.wikimedia.org/T236329
[23:46:52] <stashbot>	 T236394: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394
[23:49:49] <wikibugs>	 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn) >>! In T236329#5601691, @fgiunchedi wrote: > Also a note for when the time comes: there's Prometheus data on this host that will need to be migrated onto a VM on esams' ga...
[23:50:03] <wikibugs>	 (03CR) 10Krinkle: "Should this be under *.wikimedia.org? For JS/HTML security (CORS) might make sense to place elsewhere just in case. E.g. wmfusercontent.or" [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis)
[23:53:38] <wikibugs>	 (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/dns/+/545979" [dns] - 10https://gerrit.wikimedia.org/r/545977 (owner: 10Dzahn)
[23:54:20] <wikibugs>	 (03CR) 10Dzahn: "reverted and made https://gerrit.wikimedia.org/r/c/operations/dns/+/545979" [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio)
[23:56:27] <icinga-wm>	 PROBLEM - snapshot of s4 in eqiad on db1115 is CRITICAL: snapshot for s4 at eqiad taken more than 4 days ago: Most recent backup 2019-10-20 23:28:32 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[23:56:53] <wikibugs>	 (03CR) 10Jon Harald Søby: [C: 03+1] add ge.wikimedia.org for Georgia chapter [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) (owner: 10Dzahn)
[23:58:23] <wikibugs>	 (03PS1) 10Milimetric: Sync geoeditors data to dumps and add links [puppet] - 10https://gerrit.wikimedia.org/r/545982 (https://phabricator.wikimedia.org/T131280)