[00:03:18] !log restarting gerrit to increase heap_size from 20G to 32G (T225166 T222391) [00:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:28] T225166: Gerrit crashed due to out of Heap - https://phabricator.wikimedia.org/T225166 [00:03:28] T222391: Gerrit Hardware Upgrade (+ upgrade from jessie to stretch or buster) - https://phabricator.wikimedia.org/T222391 [00:05:15] (03CR) 10Dzahn: "gerrit restarted" [puppet] - 10https://gerrit.wikimedia.org/r/545381 (https://phabricator.wikimedia.org/T225166) (owner: 10Dzahn) [00:28:46] (03CR) 10Dzahn: [C: 03+2] "cosmetic reasons, it's not like we receive mail here anymore" [puppet] - 10https://gerrit.wikimedia.org/r/544078 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [00:29:17] (03PS2) 10Dzahn: exim: switch mail for RT to moscovium [puppet] - 10https://gerrit.wikimedia.org/r/544078 (https://phabricator.wikimedia.org/T180641) [00:36:31] (03PS2) 10Dzahn: ATS/varnish: replace director for RT with moscovium [puppet] - 10https://gerrit.wikimedia.org/r/544077 (https://phabricator.wikimedia.org/T180641) [00:43:40] (03PS3) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [00:45:11] RECOVERY - MariaDB Slave Lag: m3 on db2078 is OK: OK slave_sql_lag Replication lag: 0.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:45:57] RECOVERY - MariaDB Slave Lag: m3 on db2065 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [00:51:09] (03PS1) 10Dzahn: cumin: fix some Python lint in wmf_auto_reimage_lib [puppet] - 10https://gerrit.wikimedia.org/r/545687 [00:53:18] (03CR) 10jerkins-bot: [V: 04-1] cumin: fix some Python lint in wmf_auto_reimage_lib [puppet] - 10https://gerrit.wikimedia.org/r/545687 (owner: 10Dzahn) [00:58:46] (03CR) 10Dzahn: "finally fixed?" [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [00:59:01] (03PS2) 10Dzahn: cumin: fix some Python lint in wmf_auto_reimage_lib [puppet] - 10https://gerrit.wikimedia.org/r/545687 [01:16:35] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:16:41] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:18:11] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:18:17] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:02:11] (03PS1) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 [02:42:29] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:01:15] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20309576 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:02:51] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 52112 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:06:19] (03CR) 10BBlack: "The DHCP part only covers the first 9 hosts in Rack 15 (updated commitmsg to be more correct and explicit about it). So no dns3001 yet." [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [03:06:53] (03PS3) 10BBlack: Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) [03:08:17] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:47] (03CR) 10BBlack: [C: 03+2] basic DNS entries for new esams hosts [dns] - 10https://gerrit.wikimedia.org/r/545662 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [03:19:06] (03PS4) 10BBlack: Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) [03:37:05] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:46] (03PS1) 10BBlack: cp30[56][0-9]: add hiera/conftool data [puppet] - 10https://gerrit.wikimedia.org/r/545691 (https://phabricator.wikimedia.org/T233242) [03:46:25] hmmm cp30[56].... that needs to be allowed on the acme-chief config [03:49:21] (03PS1) 10BBlack: lvs300[567]: add public1-esams addrs [dns] - 10https://gerrit.wikimedia.org/r/545692 (https://phabricator.wikimedia.org/T236294) [03:50:37] (03PS1) 10Vgutierrez: acme_chief: Grant new esams cp hosts access to the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/545693 (https://phabricator.wikimedia.org/T234803) [03:51:22] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Grant new esams cp hosts access to the unified certificate [puppet] - 10https://gerrit.wikimedia.org/r/545693 (https://phabricator.wikimedia.org/T234803) (owner: 10Vgutierrez) [03:55:17] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:55:38] (03PS1) 10BBlack: lvs300[567]: LVS puppetization [puppet] - 10https://gerrit.wikimedia.org/r/545696 (https://phabricator.wikimedia.org/T236294) [03:55:57] !log temporarily turn down accept delay on fermium - T235983 [03:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:03] T235983: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983 [03:56:13] (03CR) 10BBlack: [C: 03+2] lvs300[567]: add public1-esams addrs [dns] - 10https://gerrit.wikimedia.org/r/545692 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [04:00:56] (03CR) 10BBlack: [C: 03+2] Basic install for new esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/545658 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [04:48:21] (03PS1) 10Marostegui: site.pp: Remove puppet references for db1070 [puppet] - 10https://gerrit.wikimedia.org/r/545698 (https://phabricator.wikimedia.org/T235464) [04:48:31] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db1070 [dns] - 10https://gerrit.wikimedia.org/r/545699 (https://phabricator.wikimedia.org/T235464) [04:48:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [04:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [04:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:51:48] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references for db1070 [puppet] - 10https://gerrit.wikimedia.org/r/545698 (https://phabricator.wikimedia.org/T235464) (owner: 10Marostegui) [04:52:22] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db1070 [dns] - 10https://gerrit.wikimedia.org/r/545699 (https://phabricator.wikimedia.org/T235464) (owner: 10Marostegui) [04:55:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1097:3315 after compression', diff saved to https://phabricator.wikimedia.org/P9460 and previous config saved to /var/cache/conftool/dbconfig/20191024-045544-marostegui.json [04:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1089 from special slaves group and leave it with its original pooling options T223151', diff saved to https://phabricator.wikimedia.org/P9461 and previous config saved to /var/cache/conftool/dbconfig/20191024-045924-marostegui.json [04:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:30] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [04:59:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1097:3315 after compression', diff saved to https://phabricator.wikimedia.org/P9462 and previous config saved to /var/cache/conftool/dbconfig/20191024-045954-marostegui.json [04:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:55] (03PS1) 10Marostegui: db2048,db2061: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/545700 (https://phabricator.wikimedia.org/T228258) [05:07:39] (03CR) 10Marostegui: [C: 03+2] db2048,db2061: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/545700 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [05:18:34] !log Run analyze enwiki.revision on db2092 T223151 [05:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:39] T223151: Review special replica partitioning of certain tables by `xx_user` - https://phabricator.wikimedia.org/T223151 [05:20:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1097:3315 after compression', diff saved to https://phabricator.wikimedia.org/P9463 and previous config saved to /var/cache/conftool/dbconfig/20191024-052002-marostegui.json [05:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:48] (03PS1) 10Vgutierrez: install_server: Fix MAC addresses for new esams boxes [puppet] - 10https://gerrit.wikimedia.org/r/545701 (https://phabricator.wikimedia.org/T236294) [06:13:18] (03CR) 10Vgutierrez: [C: 03+2] install_server: Fix MAC addresses for new esams boxes [puppet] - 10https://gerrit.wikimedia.org/r/545701 (https://phabricator.wikimedia.org/T236294) (owner: 10Vgutierrez) [06:16:20] (03PS1) 10Giuseppe Lavagetto: parsoid: actually support safe restarts at deploy time. [puppet] - 10https://gerrit.wikimedia.org/r/545702 (https://phabricator.wikimedia.org/T236275) [06:18:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] parsoid: actually support safe restarts at deploy time. [puppet] - 10https://gerrit.wikimedia.org/r/545702 (https://phabricator.wikimedia.org/T236275) (owner: 10Giuseppe Lavagetto) [06:30:10] (03PS5) 10Giuseppe Lavagetto: discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [06:31:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [06:32:42] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=parsoid-php,name=eqiad [06:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:40] (03CR) 10Giuseppe Lavagetto: "> "error: Name 'parsoid-php.discovery.wmnet.': resolver plugin" [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn) [06:36:08] (03PS2) 10Giuseppe Lavagetto: add metafo record for parsoid-php [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn) [06:38:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] add metafo record for parsoid-php [dns] - 10https://gerrit.wikimedia.org/r/543737 (owner: 10Dzahn) [06:40:01] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:48:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM; one nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545640 (owner: 10Jbond) [06:52:51] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [07:02:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (PCC also seems fine: https://puppet-compiler.wmflabs.org/compiler1002/19035/)" [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [07:07:07] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [07:09:01] starting to work on mr1, will have to bring it down a couple times at least for upgrades [07:12:38] (03CR) 10Muehlenhoff: systemd: fixes in coredump class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli) [07:15:41] PROBLEM - Juniper alarms on cr3-esams is CRITICAL: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [07:16:13] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:51] PROBLEM - Host re0.cr3-esams is DOWN: PING CRITICAL - Packet loss = 100% [07:22:02] !log drain Telia link on cr2-esams [07:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [07:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:34] (03CR) 10Muehlenhoff: "Confirmed; anyone who's staff in some way (contractor, reqnr, vendor) should use the @wikimedia.org in data.yaml, we're running some consi" [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [07:24:57] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [07:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:42] ok.... [07:29:59] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:30:05] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:30:11] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:31:17] 04Critical Alert for device cr3-esams.wikimedia.org - Juniper alarm active [07:38:00] (03PS1) 10Vgutierrez: hiera: Provide storage configuration for ats-backend on cp3055 [puppet] - 10https://gerrit.wikimedia.org/r/545706 (https://phabricator.wikimedia.org/T233242) [07:42:44] !log bump rsyslog- topics partitions to 6 and roll-restart logstash frontends [07:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:00] (03CR) 10Vgutierrez: [C: 03+2] hiera: Provide storage configuration for ats-backend on cp3055 [puppet] - 10https://gerrit.wikimedia.org/r/545706 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez) [07:50:29] (03PS1) 10Vgutierrez: install_server: Fix dns3002 FQDN [puppet] - 10https://gerrit.wikimedia.org/r/545709 (https://phabricator.wikimedia.org/T236217) [07:50:57] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:51:21] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [07:51:46] (03CR) 10Vgutierrez: [C: 03+2] install_server: Fix dns3002 FQDN [puppet] - 10https://gerrit.wikimedia.org/r/545709 (https://phabricator.wikimedia.org/T236217) (owner: 10Vgutierrez) [07:53:15] ^^ that icinga error is expected/known? [07:55:49] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:55:55] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:56:07] (03CR) 10Volans: wmf_auto_reimage: Adjust message about waiting for puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [07:57:12] (03CR) 10ArielGlenn: "There's something amiss: https://puppet-compiler.wmflabs.org/compiler1001/19036/labstore1006.wikimedia.org/change.labstore1006.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [07:57:15] !log reboot mr1-esams [07:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:18] the icinga error is "Could not find hostgroup matching 'asw2-esams.mgmt.esams.wmnet" [07:57:32] (03PS1) 10Vgutierrez: hiera: Provide ats storage config for new esams upload hosts [puppet] - 10https://gerrit.wikimedia.org/r/545711 (https://phabricator.wikimedia.org/T233242) [07:57:40] so related to the mr1 maintenance [07:57:41] mutante: ah, that's because it's the lldp neighbor for the new servers [07:57:47] moritzm: ^ [07:57:57] and it's not in icinga yet, I can add it [07:57:58] yep, I was just curious why icinga1001 alerted [07:58:35] no hurry, better complete the mr1 stuff first [07:59:07] (03CR) 10Ema: "The change itself looks good, but we first need to add profile::tlsproxy::envoy to role::requesttracker. I see that's commented out for no" [puppet] - 10https://gerrit.wikimedia.org/r/544077 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [07:59:28] because puppet add icinga parent/child relationships based on lldp [07:59:31] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Gilles) It is indeed unusual for this to apply to specific pages of a small PDF, even moreso... [08:00:09] (03PS1) 10Vgutierrez: hiera: Provide varnish storage config for new cp text hosts [puppet] - 10https://gerrit.wikimedia.org/r/545712 (https://phabricator.wikimedia.org/T233242) [08:02:26] (03PS1) 10Ayounsi: Add asw2-esams to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/545713 [08:02:45] moritzm: feel free to merge if you think it's ok - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545713 [08:07:03] (03CR) 10Volans: [C: 04-1] "It seems that the same logic is shared by all wdqs cookbooks, so a DRYer approach would be to move it to a small function that accept the " (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 (owner: 10Mathew.onipe) [08:07:49] (03PS2) 10Volans: fix unused format [cookbooks] - 10https://gerrit.wikimedia.org/r/545672 (owner: 10Mathew.onipe) [08:08:28] XioNoX: currently in the middle of something, will have a look later [08:09:20] (03CR) 10Ema: [C: 03+1] hiera: Provide ats storage config for new esams upload hosts [puppet] - 10https://gerrit.wikimedia.org/r/545711 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez) [08:09:35] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545687 (owner: 10Dzahn) [08:13:05] (03CR) 10Ema: [C: 03+1] hiera: Provide varnish storage config for new cp text hosts [puppet] - 10https://gerrit.wikimedia.org/r/545712 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez) [08:14:05] (03CR) 10Vgutierrez: [C: 03+2] hiera: Provide ats storage config for new esams upload hosts [puppet] - 10https://gerrit.wikimedia.org/r/545711 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez) [08:14:46] (03CR) 10Vgutierrez: [C: 03+2] hiera: Provide varnish storage config for new cp text hosts [puppet] - 10https://gerrit.wikimedia.org/r/545712 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez) [08:15:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2092 for analyze table', diff saved to https://phabricator.wikimedia.org/P9465 and previous config saved to /var/cache/conftool/dbconfig/20191024-081519-marostegui.json [08:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:32] (03CR) 10Ayounsi: [C: 03+2] Add asw2-esams to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/545713 (owner: 10Ayounsi) [08:15:45] (03CR) 10Ema: [C: 03+1] cp30[56][0-9]: add hiera/conftool data [puppet] - 10https://gerrit.wikimedia.org/r/545691 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack) [08:16:05] 10Operations, 10DC-Ops, 10decommission: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10fgiunchedi) Also a note for when the time comes: there's Prometheus data on this host that will need to be migrated onto a VM on esams' ganeti cluster once that's online [08:17:58] 10Operations, 10observability: Icinga last puppet run check: re-enable relaxed per-host check - https://phabricator.wikimedia.org/T236345 (10Volans) [08:18:08] 10Operations, 10observability: Icinga last puppet run check: re-enable relaxed per-host check - https://phabricator.wikimedia.org/T236345 (10Volans) p:05Triage→03Normal [08:18:20] (03CR) 10Volans: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/545672 (owner: 10Mathew.onipe) [08:18:31] !log roll restart rsyslog in ulsfo/esams/eqsin to pick up new kafka partitions [08:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:55] (03Merged) 10jenkins-bot: fix unused format [cookbooks] - 10https://gerrit.wikimedia.org/r/545672 (owner: 10Mathew.onipe) [08:19:58] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['dns3002.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto-reima... [08:21:15] moritzm: I merged it, let me know if it solves the issue [08:21:17] 04̶C̶r̶i̶t̶i̶c̶a̶l Device mr1-esams.wikimedia.org recovered from Juniper alarm active [08:22:44] Icinga will tell us in a bit :-) [08:25:15] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Error: 429, Too Many Requests while trying to access other resolutions for a PDF file - https://phabricator.wikimedia.org/T236240 (10Gilles) It seems like the ghostscript command used by Thumbor outputs some errors to stdout... [08:25:55] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) [08:26:40] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) a:03Gilles [08:27:10] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) I will try looking at this in my spare time, but can't promis... [08:28:18] (03CR) 10Vgutierrez: [C: 03+2] cp30[56][0-9]: add hiera/conftool data [puppet] - 10https://gerrit.wikimedia.org/r/545691 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack) [08:28:34] (03PS2) 10Vgutierrez: cp30[56][0-9]: add hiera/conftool data [puppet] - 10https://gerrit.wikimedia.org/r/545691 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack) [08:28:37] (03CR) 10Ema: [C: 03+1] lvs300[567]: LVS puppetization [puppet] - 10https://gerrit.wikimedia.org/r/545696 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [08:33:56] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983 (10colewhite) a:03colewhite [08:34:25] (03PS1) 10Ema: puppetboard: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545716 (https://phabricator.wikimedia.org/T210411) [08:35:20] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Elitre) In the meantime, you have all my appreciation. [08:36:38] !log roll restart rsyslog in codfw/eqiad to pick up new kafka partitions [08:36:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:01] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3002.wikimedia.org'] ` Of which those **FAILED**: ` ['dns3002.wikimedia.org'] ` [08:37:43] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:41] (03PS1) 10Ema: puppetboard: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/545717 (https://phabricator.wikimedia.org/T210411) [08:39:44] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [08:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:52] (03PS1) 10Ema: secret: dummy key for puppetboard [labs/private] - 10https://gerrit.wikimedia.org/r/545718 (https://phabricator.wikimedia.org/T210411) [08:39:55] (03PS1) 10Gehel: Maps: remove varnish URI sanitization for maps (now done in Kartotherian) [puppet] - 10https://gerrit.wikimedia.org/r/545723 (https://phabricator.wikimedia.org/T232817) [08:39:57] (03PS1) 10Ema: ATS: use TLS and DNS discovery to connect to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/545724 (https://phabricator.wikimedia.org/T210411) [08:40:10] 10Operations, 10Discovery-Search, 10observability: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10fgiunchedi) Picking this up as part of {T235891}, and to answer your question @jbond the current way is via scap + puppet [08:40:22] (03PS2) 10Gehel: Maps: remove varnish URI sanitization for maps (now done in Kartotherian) [puppet] - 10https://gerrit.wikimedia.org/r/545723 (https://phabricator.wikimedia.org/T232817) [08:40:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:03] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:36] (03PS1) 10Ema: Add puppetboard.discovery.wmnet pointing to puppetboard1001 [dns] - 10https://gerrit.wikimedia.org/r/545733 (https://phabricator.wikimedia.org/T210411) [08:43:52] (03PS1) 10Vgutierrez: hiera: Add dns3002 to ntp_peers list [puppet] - 10https://gerrit.wikimedia.org/r/545744 (https://phabricator.wikimedia.org/T236217) [08:44:37] 10Operations, 10Discovery-Search, 10observability: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10fgiunchedi) [08:44:44] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi) [08:44:53] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [08:44:57] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 93, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:45:33] RECOVERY - Juniper alarms on cr3-esams is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [08:45:45] RECOVERY - Host re0.cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 83.82 ms [08:45:47] (03PS1) 10Muehlenhoff: The new esams cache hosts are configured to use the new NMVE setup in partman.cfg, but we also need to configure them in the late-command.sh script which performs the actual setup. [puppet] - 10https://gerrit.wikimedia.org/r/545752 [08:46:08] (03CR) 10Vgutierrez: [C: 03+2] hiera: Add dns3002 to ntp_peers list [puppet] - 10https://gerrit.wikimedia.org/r/545744 (https://phabricator.wikimedia.org/T236217) (owner: 10Vgutierrez) [08:46:21] (03PS2) 10Vgutierrez: hiera: Add dns3002 to ntp_peers list [puppet] - 10https://gerrit.wikimedia.org/r/545744 (https://phabricator.wikimedia.org/T236217) [08:46:31] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [08:46:39] (03CR) 10jerkins-bot: [V: 04-1] The new esams cache hosts are configured to use the new NMVE setup in partman.cfg, but we also need to configure them in the late-command.sh script which performs the actual setup. [puppet] - 10https://gerrit.wikimedia.org/r/545752 (owner: 10Muehlenhoff) [08:47:46] (03PS2) 10Muehlenhoff: Fix NVME setup for new esams caches [puppet] - 10https://gerrit.wikimedia.org/r/545752 [08:53:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:57] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:55:03] (03CR) 10Vgutierrez: [C: 03+1] Fix NVME setup for new esams caches [puppet] - 10https://gerrit.wikimedia.org/r/545752 (owner: 10Muehlenhoff) [08:56:46] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['lvs3006.esams.wmnet'] ` The log can be found in `/var/log/wm... [09:01:17] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr3-esams.wikimedia.org recovered from Juniper alarm active [09:06:21] (03PS3) 10Muehlenhoff: Fix NVME setup for new esams caches [puppet] - 10https://gerrit.wikimedia.org/r/545752 [09:06:30] (03PS4) 10Jbond: puppet: clean up unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 [09:07:13] (03PS1) 10Gilles: Define performance survey in a more bulletproof way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545779 (https://phabricator.wikimedia.org/T234853) [09:07:48] (03CR) 10Muehlenhoff: [C: 03+2] Fix NVME setup for new esams caches [puppet] - 10https://gerrit.wikimedia.org/r/545752 (owner: 10Muehlenhoff) [09:08:17] (03PS2) 10Gilles: Define performance survey in a more bulletproof way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545779 (https://phabricator.wikimedia.org/T234853) [09:09:11] (03CR) 10Jbond: [C: 03+2] puppet: clean up unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/545640 (owner: 10Jbond) [09:09:23] (03CR) 10Gilles: [C: 03+2] Define performance survey in a more bulletproof way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545779 (https://phabricator.wikimedia.org/T234853) (owner: 10Gilles) [09:10:15] (03Merged) 10jenkins-bot: Define performance survey in a more bulletproof way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545779 (https://phabricator.wikimedia.org/T234853) (owner: 10Gilles) [09:12:15] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_wezen.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:12:16] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T234853 Re-enable performance perception survey on ruwiki (duration: 01m 04s) [09:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:20] T234853: Performance survey died on ruwiki on Sep 26 - https://phabricator.wikimedia.org/T234853 [09:12:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:12:28] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3055.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [09:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:45] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_wezen.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [09:14:31] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:29] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [09:15:54] rsyslog expected btw ^ [09:15:59] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [09:22:58] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3006.esams.wmnet'] ` and were **ALL** successful. [09:23:41] (03PS1) 10Vgutierrez: hiera: Fix asw2-esams hostgroup name [puppet] - 10https://gerrit.wikimedia.org/r/545782 [09:24:54] (03CR) 10Vgutierrez: [C: 03+2] lvs300[567]: LVS puppetization [puppet] - 10https://gerrit.wikimedia.org/r/545696 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [09:25:04] (03PS2) 10Vgutierrez: lvs300[567]: LVS puppetization [puppet] - 10https://gerrit.wikimedia.org/r/545696 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [09:26:12] (03CR) 10Ayounsi: [C: 03+1] "Hostname is good, dunno about the lldp parent/child logic and implications" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez) [09:32:09] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:32:48] (03CR) 10Arturo Borrero Gonzalez: "In general LGTM. Minor comments inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545679 (https://phabricator.wikimedia.org/T235252) (owner: 10Alex Monk) [09:33:22] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:27] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:23] going to restart Jenkins to handle a plugin upgrade [09:38:30] (03PS1) 10Vgutierrez: lvs: Fix interface names for lvs300[5-7] and provide interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/545785 (https://phabricator.wikimedia.org/T236294) [09:39:38] (03CR) 10Vgutierrez: [C: 03+2] lvs: Fix interface names for lvs300[5-7] and provide interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/545785 (https://phabricator.wikimedia.org/T236294) (owner: 10Vgutierrez) [09:40:47] (03PS1) 10Filippo Giunchedi: aptrepo: add elastic 7 [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854) [09:43:05] (03CR) 10Filippo Giunchedi: "Note that package names in this case have '-oss' appended, i.e. elasticsearch-oss so puppet will need adjustments too." [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:43:13] (03PS2) 10Vgutierrez: hiera,netops: Fix asw2-esams hostname [puppet] - 10https://gerrit.wikimedia.org/r/545782 [09:43:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:45:10] (03CR) 10Muehlenhoff: [C: 03+1] "The elasticsearch puppet classes are already adapted for the -oss package names, but will likely need some fixups to cover 7.x as well." [puppet] - 10https://gerrit.wikimedia.org/r/545786 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [09:45:17] 10Operations, 10Discovery-Search, 10observability: Change logstash plugin deployment to use deb packaging and deployment - https://phabricator.wikimedia.org/T217340 (10fgiunchedi) With the upgrade to Elastic 7 as far as I can tell all logstash plugins we're shipping will be included already, in other words w... [09:47:02] (03PS4) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) [09:47:19] (03CR) 10Ayounsi: [C: 03+1] hiera,netops: Fix asw2-esams hostname [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez) [09:47:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:47:49] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:48:22] (03CR) 10Jbond: "moritzm, thanks for the review although i decided against using $settings::localcacert as such can you give another review, thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [09:48:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:48:49] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:48:59] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:49:08] (03CR) 10jerkins-bot: [V: 04-1] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [09:49:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:49:25] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:49:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:49:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:50:46] uh... [09:50:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:51:01] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:51:22] mhh looks like a spike [09:52:01] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:52:11] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:52:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:52:37] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:53:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:53:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:53:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [09:57:32] !log Restarting CI Jenkins [09:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:19] (03PS5) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) [09:58:50] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3055.esams.wmnet'] ` and were **ALL** successful. [09:59:38] !log Converting CI jobs to use the new PostBuildScript plugin config | https://gerrit.wikimedia.org/r/#/c/integration/config/+/544907/ | T188398 [09:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:26] (03PS6) 10Jbond: puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) [10:10:56] matthiasmullie: SWAT is only in one hour, right? because I just saw you +2ed one of the backports… [10:11:01] or maybe I’m wrong, let’s check [10:11:03] jouncebot: now [10:11:03] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [10:11:49] Lucas_WMDE: yeah - getting a head start to get the test suites to complete in time [10:11:58] ok [10:12:12] there's 3 patches that depend on one another, so I'll merge them in sequentially [10:17:00] (03PS1) 10Jbond: jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545798 [10:17:40] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for puppetboard [labs/private] - 10https://gerrit.wikimedia.org/r/545718 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:17:42] (03CR) 10Volans: [C: 03+2] "merging to unblock icinga for now, we'll re-evaluate it later" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez) [10:17:51] thx volans <3 [10:17:55] 10Operations, 10Traffic, 10observability: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10fgiunchedi) I've updated the frontend-traffic dashboard to include global availability correctly, and got rid of the summed value [10:18:01] (03CR) 10Ema: [C: 03+2] puppetboard: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545716 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:18:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Adjust s6 weights for db1093 and db1085', diff saved to https://phabricator.wikimedia.org/P9466 and previous config saved to /var/cache/conftool/dbconfig/20191024-101810-marostegui.json [10:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:21] (03PS2) 10Jbond: jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545798 [10:20:23] (03CR) 10Ema: [C: 03+2] Add puppetboard.discovery.wmnet pointing to puppetboard1001 [dns] - 10https://gerrit.wikimedia.org/r/545733 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:20:36] (03PS3) 10Jbond: jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545798 [10:21:01] (03CR) 10Alexandros Kosiaris: "ugh, its there no way to override that? Perhaps some doc/changelog from juniper documenting this change would give us a cleaner way out?" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez) [10:21:42] (03CR) 10Jbond: [C: 03+2] jenkins: use correct java version on buster [puppet] - 10https://gerrit.wikimedia.org/r/545798 (owner: 10Jbond) [10:22:51] (03CR) 10Ema: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/19040/" [puppet] - 10https://gerrit.wikimedia.org/r/545717 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:24:16] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) [10:24:24] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) a:05Gilles→03None [10:24:30] (03CR) 10Volans: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez) [10:25:02] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) [10:25:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] blubberoid: Add TLS termination (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/544774 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [10:25:32] (03CR) 10Ema: [C: 03+2] ATS: use TLS and DNS discovery to connect to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/545724 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [10:25:51] PROBLEM - Host lvs3006 is DOWN: PING CRITICAL - Packet loss = 100% [10:26:21] RECOVERY - Host lvs3006 is UP: PING OK - Packet loss = 0%, RTA = 83.44 ms [10:26:32] ^^ that was expected... [10:26:46] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) [10:27:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] scaffold: Add option for TLS termination [deployment-charts] - 10https://gerrit.wikimedia.org/r/543854 (https://phabricator.wikimedia.org/T236008) (owner: 10Giuseppe Lavagetto) [10:27:19] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) p:05Triage→03High [10:27:28] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) a:03Krinkle [10:27:59] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) Assigning this to you @krinkle so you can do more digging on the ResourceLoader side of things. [10:28:08] (03CR) 10Ayounsi: "that's the best doc so far - https://lists.gt.net/nsp/juniper/66466" [puppet] - 10https://gerrit.wikimedia.org/r/545782 (owner: 10Vgutierrez) [10:28:37] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) [10:28:48] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not/randomly picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) [10:29:37] PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers maerlant.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:29:57] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [10:30:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I like this approach, albeit it has one minor caveat, that is assumes all extra ports will be of a debug nature. Which is possibly fine. L" [deployment-charts] - 10https://gerrit.wikimedia.org/r/544629 (owner: 10Giuseppe Lavagetto) [10:30:02] 10Operations, 10observability: Tune HTTP availability alerts - https://phabricator.wikimedia.org/T236367 (10fgiunchedi) [10:36:14] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201... [10:37:48] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [10:38:02] (03PS1) 10Filippo Giunchedi: monitoring: alert on reduced global http availability [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367) [10:38:31] (03PS1) 10Jbond: jenkins: update java version for buster installs [puppet] - 10https://gerrit.wikimedia.org/r/545803 [10:39:12] (03CR) 10jerkins-bot: [V: 04-1] monitoring: alert on reduced global http availability [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367) (owner: 10Filippo Giunchedi) [10:41:08] (03PS2) 10Jbond: jenkins: update java version for buster installs [puppet] - 10https://gerrit.wikimedia.org/r/545803 [10:41:25] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3057.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [10:41:36] (03CR) 10Effie Mouzeli: systemd: fixes in coredump class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) (owner: 10Effie Mouzeli) [10:45:45] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 39, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:45:59] RECOVERY - PyBal backends health check on lvs3006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:47:15] (03PS2) 10Filippo Giunchedi: monitoring: alert on reduced global http availability [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367) [10:47:40] jouncebot: next [10:47:40] In 0 hour(s) and 12 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1100) [10:47:47] jouncebot: refresh [10:47:49] I refreshed my knowledge about deployments. [10:48:57] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 43, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:49:42] (03PS1) 10Vgutierrez: fix public interface name for lvs300[5-7] [dns] - 10https://gerrit.wikimedia.org/r/545805 (https://phabricator.wikimedia.org/T236294) [10:50:13] (03CR) 10Jbond: [C: 03+2] jenkins: update java version for buster installs [puppet] - 10https://gerrit.wikimedia.org/r/545803 (owner: 10Jbond) [10:50:16] (03CR) 10Vgutierrez: [C: 03+2] fix public interface name for lvs300[5-7] [dns] - 10https://gerrit.wikimedia.org/r/545805 (https://phabricator.wikimedia.org/T236294) (owner: 10Vgutierrez) [10:52:27] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3060.esams.wmnet'] ` [10:52:59] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [10:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:00] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:53] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:55:55] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:04] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:56:05] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:17] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:56:18] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:26] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:56:27] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:32] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:56:32] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:47] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:56:47] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:54] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [10:56:55] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1100). [11:00:05] matthiasmullie and hauskater: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:13] o/ [11:00:14] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:21] I can SWAT today! [11:00:29] o/ [11:01:08] matthiasmullie: are you deploying your own changes? [11:01:12] that's ok - I can do them this time :) [11:01:15] ok :) [11:01:16] yeah, I'll do them [11:01:36] in that case, matthiasmullie please ping me once you're done [11:02:05] Urbanecm: will do! [11:02:10] thanks! [11:02:58] (03PS1) 10Muehlenhoff: Document database setup [software/debmonitor] - 10https://gerrit.wikimedia.org/r/545811 [11:04:01] (03PS1) 10Jbond: jenkins: also update alternatives for java [puppet] - 10https://gerrit.wikimedia.org/r/545812 [11:07:08] (03PS1) 10Ema: labweb: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545813 (https://phabricator.wikimedia.org/T210411) [11:07:10] (03PS1) 10Ema: cp3060: add to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545814 (https://phabricator.wikimedia.org/T233242) [11:08:04] (03CR) 10Ema: [C: 03+2] labweb: add certificate [puppet] - 10https://gerrit.wikimedia.org/r/545813 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [11:08:17] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:08:17] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:08:19] (03CR) 10Ema: [C: 03+2] cp3060: add to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545814 (https://phabricator.wikimedia.org/T233242) (owner: 10Ema) [11:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:24] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:08:24] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:47] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10Jclark-ctr) Starting Pdu Replacement [11:09:19] (03PS1) 10Urbanecm: Add CAT as alias for NS_CATEGORY at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545815 (https://phabricator.wikimedia.org/T236352) [11:09:33] (03CR) 10Jbond: [C: 03+2] jenkins: also update alternatives for java [puppet] - 10https://gerrit.wikimedia.org/r/545812 (owner: 10Jbond) [11:09:37] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['cp3059.esams.wmnet'] ` The log can be found in `... [11:09:42] starting pdu refresh b4-eqiad https://phabricator.wikimedia.org/T227540 [11:09:58] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/lo... [11:10:09] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3060.esams.wmnet'] ` [11:10:20] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/lo... [11:13:19] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/Wikibase: Allow defining entity-type-specific PrefetchingTermLookup (duration: 01m 06s) [11:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:45] (03PS1) 10Urbanecm: Permission changes of move-rootuserpages assignment at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359) [11:14:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:14:31] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:14:47] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:15:43] !log mlitn@deploy1001 Synchronized php-1.35.0-wmf.3/extensions/WikibaseMediaInfo: Also use custom PrefetchingTermLookup in SingleEntitySourceServices (duration: 01m 01s) [11:15:47] (03PS2) 10Urbanecm: Permission changes of move-rootuserpages assignment at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359) [11:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:16:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:16:09] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:16:35] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3057.esams.wmnet'] ` and were **ALL** successful. [11:16:45] Urbanecm: done [11:16:59] matthiasmullie: thanks! [11:17:13] (03CR) 10Urbanecm: [C: 03+2] Restrict uploads on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545655 (https://phabricator.wikimedia.org/T236307) (owner: 10MarcoAurelio) [11:17:15] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={cp2006:9536,cp2019:9536} site=codfw tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:17:19] hauskater: +2'ed your patch [11:17:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:17:25] ack [11:17:41] cp3060 noise is expected [11:17:57] (03Merged) 10jenkins-bot: Restrict uploads on azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545655 (https://phabricator.wikimedia.org/T236307) (owner: 10MarcoAurelio) [11:18:00] (03PS1) 10Ayounsi: New esams knams links + tilaa oob interface rename [dns] - 10https://gerrit.wikimedia.org/r/545817 [11:18:23] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=cp1081:9536 site=eqiad tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:19:05] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:19:33] hauskater: pulled at mwdebug1001, could you test please? [11:19:39] sure, doing [11:19:43] (03CR) 10Urbanecm: [C: 03+2] Add CAT as alias for NS_CATEGORY at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545815 (https://phabricator.wikimedia.org/T236352) (owner: 10Urbanecm) [11:20:24] (03Merged) 10jenkins-bot: Add CAT as alias for NS_CATEGORY at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545815 (https://phabricator.wikimedia.org/T236352) (owner: 10Urbanecm) [11:20:29] Urbanecm: lgtm, listgrouprights and sitebar upload link correctly configured [11:20:32] (03CR) 10Ayounsi: [C: 03+2] New esams knams links + tilaa oob interface rename [dns] - 10https://gerrit.wikimedia.org/r/545817 (owner: 10Ayounsi) [11:20:35] *sidebar [11:20:38] hauskater: thanks, deploying [11:21:25] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:22:20] !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: SWAT: 2d66deb: Restrict uploads on azwiki (T236307) (duration: 01m 03s) [11:22:24] hauskater: synced [11:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:31] T236307: "Fayl yüklə" (Upload file) in AzWiki's Tools' Bar to Upload Wizard in Commons - https://phabricator.wikimedia.org/T236307 [11:22:34] perfect [11:22:36] thanks [11:22:40] happy to help! [11:23:54] (03PS3) 10Urbanecm: Permission changes of move-rootuserpages assignment at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359) [11:24:00] (03CR) 10Urbanecm: [C: 03+2] Permission changes of move-rootuserpages assignment at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359) (owner: 10Urbanecm) [11:24:47] (03Merged) 10jenkins-bot: Permission changes of move-rootuserpages assignment at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545816 (https://phabricator.wikimedia.org/T236359) (owner: 10Urbanecm) [11:26:04] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: e079956: Add CAT as alias for NS_CATEGORY at commonswiki (T236352) (duration: 01m 00s) [11:26:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:12] T236352: Create CAT: redirect for Category: namespace on Commons - https://phabricator.wikimedia.org/T236352 [11:26:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:24] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:09] (03PS5) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) [11:30:55] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:30:59] !log Run mwscript namespaceDupes.php --wiki=commonswiki --add-prefix=FIXME --fix (T236352) [11:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:13] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [11:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:27] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:31:49] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 52 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:33:16] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:33:17] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:30] (03PS2) 10Cmjohnson: Adding mgmt dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545268 (https://phabricator.wikimedia.org/T234076) [11:33:49] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:34:05] (03PS1) 10Vgutierrez: hiera: Add cp305[5,7,9] to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545819 (https://phabricator.wikimedia.org/T233242) [11:34:36] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545268 (https://phabricator.wikimedia.org/T234076) (owner: 10Cmjohnson) [11:35:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 3a5cb68: Permission changes of move-rootuserpages assignment at commonswiki (T236359) (duration: 01m 00s) [11:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:16] T236359: Remove the "move-rootuserpages" right from all user groups without autopatrol flag on Commons - https://phabricator.wikimedia.org/T236359 [11:35:54] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Vgutierrez) [11:36:12] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Vgutierrez) [11:36:45] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 46 connecting: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:36:49] (03CR) 10Alexandros Kosiaris: "Bacula is fine, the LVM module on the other hand is not ours (there are roughly 2 patches for it from us). I am not sure if it's best to p" [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond) [11:36:56] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Vgutierrez) [11:37:33] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:37:39] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:37:49] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:37:59] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:38:11] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:38:11] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:38:15] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:38:19] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [11:38:37] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:40:06] !log EU SWAT done [11:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:38] (03PS6) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) [11:42:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] hhvm: remove hhvm leftovers from apache configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:45:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [11:45:37] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3059.esams.wmnet'] ` and were **ALL** successful. [11:47:28] (03PS1) 10BBlack: esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242) [11:47:46] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:47:48] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:00] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:00] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:04] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:10] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:20] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:28] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:28] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:32] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:38] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:48:40] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:49:14] (03CR) 10Vgutierrez: [C: 03+2] hiera: Add cp305[5,7,9] to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545819 (https://phabricator.wikimedia.org/T233242) (owner: 10Vgutierrez) [11:49:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] puppetmnasters: use localcacert setting for CA file in apache [puppet] - 10https://gerrit.wikimedia.org/r/545575 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [11:49:56] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not/randomly picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Gilles) By the looks of the survey dashboard, the issue might have gone away, but i... [11:49:59] (03CR) 10jerkins-bot: [V: 04-1] esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack) [11:50:20] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [11:50:20] PROBLEM - HTTPS Unified ECDSA on cp3060 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:51:27] (03PS2) 10BBlack: esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242) [11:51:41] (03CR) 10Alexandros Kosiaris: [C: 03+1] puppet: manage localcacert in puppet [puppet] - 10https://gerrit.wikimedia.org/r/545573 (https://phabricator.wikimedia.org/T236277) (owner: 10Jbond) [11:52:18] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [11:52:20] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [11:52:20] PROBLEM - HTTPS Unified RSA on cp3060 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [11:52:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:53:10] ^ looking [11:53:15] all the varnish/ipsec stuff is "expected" whlie bringing up new nodes [11:54:20] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [11:54:20] PROBLEM - eventlogging Varnishkafka log producer on cp3060 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:55:21] (03PS1) 10Jbond: puppet_compiler: install ruby-multi-json on buster for PUP-8715 [puppet] - 10https://gerrit.wikimedia.org/r/545821 [11:55:41] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Cmjohnson) [11:56:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:56:20] (03CR) 10Jbond: [C: 03+2] puppet_compiler: install ruby-multi-json on buster for PUP-8715 [puppet] - 10https://gerrit.wikimedia.org/r/545821 (owner: 10Jbond) [11:56:24] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [11:58:24] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:58:24] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [11:58:26] PROBLEM - statsv Varnishkafka log producer on cp3060 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1200) [12:00:12] PROBLEM - Host phab1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:00:24] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:00:24] PROBLEM - Varnish HTTP text-frontend - port 80 on cp3060 is CRITICAL: connect to address 10.20.0.60 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [12:00:28] !log shutdown transit BGP sessions on cr2-knams [12:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:50] PROBLEM - Disk space on cp3060 is CRITICAL: DISK CRITICAL - /proc/sys/fs/binfmt_misc is not accessible: No such device https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3060&var-datasource=esams+prometheus/ops [12:01:24] PROBLEM - Host ripe-atlas-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:01:36] (03PS1) 10Cmjohnson: Adding production dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545823 (https://phabricator.wikimedia.org/T234076) [12:01:44] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance={cp1077:9536,cp1079:9536,cp1081:9536,cp1083:9536,cp1085:9536,cp1087:9536,cp1089:9536} site=eqiad tunnel={cp3060_v4,cp3060_v6} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:02:20] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:02:24] PROBLEM - Ensure traffic_manager is running for instance tls on cp3060 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:02:35] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for dumpsdata1003 [dns] - 10https://gerrit.wikimedia.org/r/545823 (https://phabricator.wikimedia.org/T234076) (owner: 10Cmjohnson) [12:02:36] ok the mediawiki errors [12:02:44] are for pl.wikibooks.org [12:02:59] "Maximum execution time of 180 seconds exceeded" [12:03:15] on the Scribunto extension [12:03:20] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Cmjohnson) [12:03:30] RECOVERY - Host phab1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [12:04:24] PROBLEM - Ensure traffic_server is running for instance tls on cp3060 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:06:21] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:06:21] !log shutdown cr1-esams - cr2-knams link [12:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:58] RECOVERY - Host ripe-atlas-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [12:07:44] (03PS3) 10BBlack: esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242) [12:08:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2092 after analyze table', diff saved to https://phabricator.wikimedia.org/P9468 and previous config saved to /var/cache/conftool/dbconfig/20191024-120812-marostegui.json [12:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:17] jynus: ^ [12:08:19] 10Operations, 10Puppet, 10puppet-compiler: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) p:05Triage→03Low [12:08:20] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 46 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:09:10] alright cr2-knams will go offline very soon the time I move the fibers and go replace the device [12:09:52] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10Cmjohnson) a:05Cmjohnson→03ArielGlenn @ArielGlenn the onsite work has been completed, I did add the production dns [12:10:24] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp3060_v4, cp3060_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:10:44] (03CR) 10BBlack: [C: 03+2] esams cp nodes: size storage correctly [puppet] - 10https://gerrit.wikimedia.org/r/545820 (https://phabricator.wikimedia.org/T233242) (owner: 10BBlack) [12:10:58] PROBLEM - Juniper alarms on csw2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [12:11:26] PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100% [12:11:42] (03PS1) 10Jbond: idp: add secrets [labs/private] - 10https://gerrit.wikimedia.org/r/545826 [12:12:40] PROBLEM - Host cp3034 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:40] PROBLEM - Host cp3035 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:40] PROBLEM - Host cp3036 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:40] PROBLEM - Host cp3040 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:40] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:41] PROBLEM - Host cp3039 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:41] PROBLEM - Host cp3041 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:42] PROBLEM - Host cp3044 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:42] PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:43] PROBLEM - Host cp3043 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:43] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:43] (03PS2) 10Jbond: idp: add secrets [labs/private] - 10https://gerrit.wikimedia.org/r/545826 [12:12:44] PROBLEM - Host cp3046 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:44] PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:45] PROBLEM - Host cp3047 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:49] XioNoX: ^ ? [12:12:52] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [12:12:52] PROBLEM - Host cp3033 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:52] PROBLEM - Host cp3032 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:52] PROBLEM - Host cp3030 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:52] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:52] PROBLEM - Host lvs3002 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:53] PROBLEM - Host multatuli is DOWN: PING CRITICAL - Packet loss = 100% [12:12:54] wut? [12:12:58] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [12:12:58] PROBLEM - Host bast3002 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:58] PROBLEM - Host lvs3003 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:00] PROBLEM - Host lvs3004 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:21] prepping depool patch [12:13:27] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [12:13:33] bblack: thanks [12:13:45] ticket.wikimedia.org is not reachable because of connection timed out [12:13:45] (03PS1) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/545827 [12:13:51] godog: those text-lb alerts dont' have #page but they page [12:13:53] FYI [12:13:55] doctaxon: we are on it [12:14:01] thx [12:14:06] PROBLEM - Juniper virtual chassis ports on asw2-esams.mgmt.esams.wmnet is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [12:14:11] Hmm, wikipedia down for me, just stuck loading [12:14:15] working on it [12:14:17] bblack: go ahead with the depool, cannot reach wikis [12:14:18] asw-2 issues? [12:14:20] paladox: on it [12:14:30] Thanks [12:14:31] PROBLEM - LVS HTTP IPv6 #page on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: connect to address 2620:0:862:ed1a::2:b and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:14:40] !log depool esams in geodns [12:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:48] PROBLEM - LVS HTTP IPv4 #page on upload-lb.esams.wikimedia.org is CRITICAL: connect to address 91.198.174.208 and port 80: No route to host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:15:00] it is back up for me [12:15:08] looking [12:15:11] not for me yet [12:15:14] (03CR) 10BBlack: [V: 03+2 C: 03+2] depool esams [dns] - 10https://gerrit.wikimedia.org/r/545827 (owner: 10BBlack) [12:15:20] PROBLEM - LVS HTTPS IPv6 #page on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:15:20] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:15:22] RECOVERY - Host cp3042 is UP: PING WARNING - Packet loss = 28%, RTA = 83.44 ms [12:15:22] RECOVERY - Host cp3032 is UP: PING WARNING - Packet loss = 28%, RTA = 83.43 ms [12:15:22] RECOVERY - Host cp3047 is UP: PING WARNING - Packet loss = 28%, RTA = 83.39 ms [12:15:22] RECOVERY - Host bast3002 is UP: PING WARNING - Packet loss = 28%, RTA = 83.47 ms [12:15:22] RECOVERY - Host cp3036 is UP: PING WARNING - Packet loss = 28%, RTA = 83.41 ms [12:15:24] RECOVERY - Host cp3045 is UP: PING WARNING - Packet loss = 28%, RTA = 83.43 ms [12:15:24] RECOVERY - Host lvs3002 is UP: PING WARNING - Packet loss = 28%, RTA = 83.52 ms [12:15:24] RECOVERY - Host cp3030 is UP: PING WARNING - Packet loss = 28%, RTA = 83.41 ms [12:15:24] RECOVERY - Host cp3040 is UP: PING WARNING - Packet loss = 28%, RTA = 83.42 ms [12:15:24] RECOVERY - Host cp3049 is UP: PING WARNING - Packet loss = 28%, RTA = 83.40 ms [12:15:24] RECOVERY - Host cp3039 is UP: PING WARNING - Packet loss = 28%, RTA = 83.39 ms [12:15:25] RECOVERY - Host cp3043 is UP: PING WARNING - Packet loss = 28%, RTA = 83.45 ms [12:15:25] RECOVERY - Host cp3035 is UP: PING WARNING - Packet loss = 28%, RTA = 83.43 ms [12:15:26] back for me too now, [12:15:26] RECOVERY - Host cp3034 is UP: PING WARNING - Packet loss = 28%, RTA = 83.45 ms [12:15:26] RECOVERY - Host cp3044 is UP: PING WARNING - Packet loss = 28%, RTA = 83.39 ms [12:15:27] RECOVERY - Host cp3041 is UP: PING WARNING - Packet loss = 28%, RTA = 83.43 ms [12:15:27] RECOVERY - Host lvs3003 is UP: PING WARNING - Packet loss = 37%, RTA = 83.50 ms [12:15:28] RECOVERY - Host cp3046 is UP: PING WARNING - Packet loss = 44%, RTA = 83.42 ms [12:15:28] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 44%, RTA = 83.40 ms [12:15:29] RECOVERY - Host maerlant is UP: PING WARNING - Packet loss = 44%, RTA = 83.48 ms [12:15:29] RECOVERY - Host lvs3004 is UP: PING WARNING - Packet loss = 44%, RTA = 83.49 ms [12:15:30] RECOVERY - Host cp3033 is UP: PING WARNING - Packet loss = 44%, RTA = 83.42 ms [12:15:30] RECOVERY - Host cp3038 is UP: PING WARNING - Packet loss = 44%, RTA = 83.36 ms [12:15:31] RECOVERY - Host nescio is UP: PING WARNING - Packet loss = 28%, RTA = 83.48 ms [12:15:31] RECOVERY - Host multatuli is UP: PING OK - Packet loss = 0%, RTA = 83.48 ms [12:15:33] back too now [12:15:34] PROBLEM - PyBal backends health check on lvs3006 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_80: Servers cp3044.esams.wmnet, cp3039.esams.wmnet, cp3049.esams.wmnet, cp3047.esams.wmnet, cp3038.esams.wmnet are marked down but pooled: dns_rec_53: Servers maerlant.wikimedia.org are marked down but pooled: uploadlb6_80: Servers cp3034.esams.wmnet, cp3039.esams.wmnet, cp3045.esams.wmnet, cp3047.esams.wmnet, cp3038.esams.wmnet are [12:15:34] pooled: uploadlb_443: Servers cp3036.esams.wmnet, cp3039.esams.wmnet, cp3034.esams.wmnet, cp3035.esams.wmnet, cp3045.esams.wmnet are marked down but pooled: uploadlb6_443: Servers cp3034.esams.wmnet, cp3044.esams.wmnet, cp3035.esams.wmnet are marked down but pooled: dns_rec6_53_udp: Servers maerlant.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech [12:15:34] ki/PyBal [12:15:35] did a router just reboot? [12:15:39] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 83.35 ms [12:15:44] I think asw2 [12:15:45] authdns-update couldn't even reach multatuli to deploy the depool [12:15:45] we'll know soon [12:15:51] but proceeding with depool until we understand [12:15:52] RECOVERY - Juniper virtual chassis ports on asw2-esams.mgmt.esams.wmnet is OK: OK: UP: 6 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [12:15:52] it alerted right before [12:15:58] there it goes [12:16:15] I was up on esams, apparently, not after depool [12:16:16] all authdns have the depool now [12:16:16] maybe PSUs or something ? [12:16:17] RECOVERY - LVS HTTP IPv6 #page on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 432 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:16:21] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.36 ms [12:16:24] <_joe_> hey one can't upgrade his irc bouncer and you all reboot a router :D [12:16:36] RECOVERY - LVS HTTP IPv4 #page on upload-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 419 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:16:51] and at last here come the recoveries (my pages are a couple minutes behind actual0 [12:16:56] I have failed over dc on dns now [12:17:02] RECOVERY - LVS HTTPS IPv6 #page on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 870 bytes in 0.345 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:17:12] RECOVERY - OSPF status on mr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:17:18] Works here now! [12:17:22] RECOVERY - PyBal backends health check on lvs3006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:17:58] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:18:04] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:18:06] System booted: 2019-10-24 12:11:50 UTC (00:06:02 ago) [12:18:48] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 58.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:18:51] yeah, wrong power cable got unplugged seems like :( [12:19:36] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:19:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:20:28] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 72.85 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:23:40] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 49.97 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:25:20] !log cp3060: powercycle -- NMI watchdog: BUG: soft lockup - CPU#18 stuck for 22s! [charon:1226] T233242 [12:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:26] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [12:26:04] (03CR) 10Effie Mouzeli: hhvm: remove hhvm leftovers from apache configs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [12:26:37] 10Operations, 10procurement: 2x (5) C19-to-C20 power cables, 1.8m, red/blue - https://phabricator.wikimedia.org/T236377 (10mark) [12:26:48] volans: thanks! I'll look into it [12:28:02] RECOVERY - Disk space on cp3060 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3060&var-datasource=esams+prometheus/ops [12:28:20] RECOVERY - HTTPS Unified ECDSA on cp3060 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 540502 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:28:26] RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:26] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:26] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:26] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:26] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:26] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:26] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 54 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:27] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:27] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:28] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:28] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:29] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:29] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:30] RECOVERY - IPsec on cp2012 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:30] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:31] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:31] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 48 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:28:36] RECOVERY - HTTPS Unified RSA on cp3060 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 573244 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:29:00] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:29:15] godog: so of the 5 critical and 5 recovery, 3 had the page hashtag, 2 didn't, the two being the v4 and v6 alerts for the text-lb [12:29:44] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 539 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:29:53] volans: yeah I think it is the host alerts that don't have it, as opposed to lvs service alerts [12:30:10] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [12:30:21] got it [12:30:42] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 539 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:31:06] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 539 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:31:34] RECOVERY - Ensure traffic_manager is running for instance tls on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:31:36] RECOVERY - statsv Varnishkafka log producer on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:31:45] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] ` and were **ALL** successful. [12:32:00] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:32:10] RECOVERY - Ensure traffic_server is running for instance tls on cp3060 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:32:14] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3060 is OK: HTTP OK: HTTP/1.1 200 OK - 538 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:32:44] RECOVERY - eventlogging Varnishkafka log producer on cp3060 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [12:33:00] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 19427 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:33:08] !log Stopping puppet on all hosts including the hhvm class (C:hhvm) - 544864 - T229792 [12:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:13] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [12:33:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: add secrets [labs/private] - 10https://gerrit.wikimedia.org/r/545826 (owner: 10Jbond) [12:42:24] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: name=cp3060.esams.wmnet,service=varnish-be [12:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:48] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:46:54] 10Operations, 10procurement: 2x (5) C19-to-C20 power cables, 1.8m, red/blue - https://phabricator.wikimedia.org/T236377 (10mark) The package just shipped and will hopefully arrive tomorrow. I created IM ticket SCTASK0120754 for shipment notification. [12:47:06] volans: I thought it was just a tweak but it isn't, filed as T236379 [12:47:06] T236379: Include #page on host alerts that page SRE - https://phabricator.wikimedia.org/T236379 [12:47:28] ack, thx [12:47:29] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10Jclark-ctr) a:05Jclark-ctr→03RobH @RobH Corrected cable issue on pdu [12:47:48] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10Cmjohnson) [12:48:02] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 6.475 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:48:17] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10Jclark-ctr) Finished pdu refresh [12:55:34] (03CR) 10Volans: [C: 03+1] "LGTM, just one suggestion as mentioned in the other CR" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [12:56:55] !log purge hhvm hhvm-luasandbox hhvm-tidy hhvm-wikidiff2 hhvm-dbg from mw* canaries - T229792 [12:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:01] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [12:59:52] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [13:00:04] liw and brennen: Dear deployers, time to do the Mediawiki train - European Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1300). [13:00:49] ^ godog, what is that ? [13:01:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Ok, I had that backwards. LGTM then" [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:02:44] effie: the icinga configuration errors? icinga's unhappy for some reason [13:03:09] hmm [13:03:12] (03PS1) 10Lars Wirzenius: all wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545835 [13:03:15] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545835 (owner: 10Lars Wirzenius) [13:04:59] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545835 (owner: 10Lars Wirzenius) [13:06:38] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.3 [13:06:38] 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, and 3 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10mobrovac) [13:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] (03CR) 10Filippo Giunchedi: [C: 03+1] puppet-merge: add Repository class [puppet] - 10https://gerrit.wikimedia.org/r/544943 (owner: 10Jbond) [13:08:44] effie: icinga config broken [13:08:52] Error: Could not find any contact matching 'hpham-email' (config file '/etc/icinga/objects/contactgroups.cfg', starting on line 77) [13:10:35] phamhi: ^ re: your contact change [13:17:08] (03CR) 10MSantos: [C: 03+1] Maps: remove varnish URI sanitization for maps (now done in Kartotherian) [puppet] - 10https://gerrit.wikimedia.org/r/545723 (https://phabricator.wikimedia.org/T232817) (owner: 10Gehel) [13:17:10] (03CR) 10Andrew Bogott: [C: 03+1] hhvm: remove hhvm leftovers from apache configs [puppet] - 10https://gerrit.wikimedia.org/r/545652 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:17:50] !log set ats-be weights on new esams upload nodes T233242 [13:17:50] (03CR) 10Andrew Bogott: [C: 03+1] hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:55] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [13:18:20] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [13:18:23] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3053.esams.wmnet [13:18:24] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3055.esams.wmnet [13:18:25] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3065.esams.wmnet [13:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:26] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3057.esams.wmnet [13:18:27] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3061.esams.wmnet [13:18:28] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3059.esams.wmnet [13:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:29] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3051.esams.wmnet [13:18:30] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=ats-be,name=cp3063.esams.wmnet [13:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:14] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:20:35] (03CR) 10Gehel: [C: 03+1] query_service: separate categories from main blazegraph profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539285 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:20:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:21:14] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:22:03] !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp30[56].*,cluster=cache_upload,service=nginx [13:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:23] !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp30[56].*,cluster=cache_upload,service=varnish-fe [13:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:43] !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp30[56].*,cluster=cache_text,service=nginx [13:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:02] !log bblack@cumin1001 conftool action : set/weight=1; selector: name=cp30[56].*,cluster=cache_text,service=varnish-fe [13:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:26] !log bblack@cumin1001 conftool action : set/weight=100; selector: name=cp30[56].*,service=varnish-be [13:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:31] !log ema@puppetmaster1001 conftool action : set/weight=100; selector: service=varnish-be,name=cp30[56].* [13:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:49] (03CR) 10Gehel: [C: 04-1] "A few previous comments are still unaddressed" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/537138 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:25:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Aside from the minor whitespace issue, +1 from me. And it's fine to keep the puppet structure as is." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:25:46] !log enable transit4/6 on cr2-knams [13:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:08] (03PS7) 10Effie Mouzeli: hhvm: make all files and packages absent by default [puppet] - 10https://gerrit.wikimedia.org/r/544864 (https://phabricator.wikimedia.org/T229792) [13:27:04] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539513 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:27:43] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/539998 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [13:28:30] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.74 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:30:53] (03PS1) 10Giuseppe Lavagetto: conftool::scripts: add a helper script to initialize a node [puppet] - 10https://gerrit.wikimedia.org/r/545838 [13:31:20] <_joe_> bblack, ema ^^ [13:31:28] <_joe_> this would allow you to define a [13:31:35] (03PS1) 10Filippo Giunchedi: nagios: rename s/hpham/phamhi/ [puppet] - 10https://gerrit.wikimedia.org/r/545839 [13:31:39] <_joe_> initialize-varnish-be script for instance [13:31:49] <_joe_> that you can just run via cumin like "pool" [13:32:20] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10Gehel) Note that APIFeatureUsage has the ELK cluster talk to the Cirrus elasticsearch cluster directly. This means that logstash version on ELK needs to be compatible with the elasticsearc... [13:32:57] (03CR) 10Phamhi: [C: 03+2] nagios: rename s/hpham/phamhi/ [puppet] - 10https://gerrit.wikimedia.org/r/545839 (owner: 10Filippo Giunchedi) [13:34:36] !log enable puppet on mwdebug* [13:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:59] (03PS1) 10BBlack: Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/545845 [13:36:29] (03CR) 10BBlack: [C: 03+2] Revert "depool esams" [dns] - 10https://gerrit.wikimedia.org/r/545845 (owner: 10BBlack) [13:36:40] !log re-pooling esams in dns [13:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:14] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [13:46:00] _joe_: thanks :) [13:46:10] <_joe_> ema: I was thinking [13:46:15] <_joe_> you might prefer a single command [13:46:22] <_joe_> that goes through all the objects [13:46:42] <_joe_> if that's the case, we can rework the patch [13:47:15] <_joe_> ease-of-use vs granularity [13:48:20] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [13:49:22] (03PS1) 10Ema: cp3056: add to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545852 (https://phabricator.wikimedia.org/T233242) [13:50:16] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3056.esams.wmnet'] ` The log can be found in `/var/lo... [13:50:33] (03CR) 10Ema: [C: 03+2] cp3056: add to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545852 (https://phabricator.wikimedia.org/T233242) (owner: 10Ema) [13:52:36] (03PS1) 10Effie Mouzeli: hhvm: fixes in removal [puppet] - 10https://gerrit.wikimedia.org/r/545854 (https://phabricator.wikimedia.org/T229792) [13:55:06] (03CR) 10CDanis: [C: 03+1] "LGTM++" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367) (owner: 10Filippo Giunchedi) [13:55:10] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:56:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:59:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] hhvm: fixes in removal [puppet] - 10https://gerrit.wikimedia.org/r/545854 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [13:59:44] !log pool cp3060 T233242 [13:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:50] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [14:04:40] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 33.12 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:05:19] ^ this is normal, traffic in eqiad went down due to esams repool [14:06:00] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: fixes in removal [puppet] - 10https://gerrit.wikimedia.org/r/545854 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [14:07:30] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:07:56] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 48 connecting: cp3056_v4, cp3056_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [14:08:18] (03PS1) 10Ema: Add new esams cp hosts to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545857 (https://phabricator.wikimedia.org/T233242) [14:08:56] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 47, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:08:56] PROBLEM - BFD status on cr2-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:10:32] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:10:36] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:12:01] (03CR) 10BBlack: [C: 03+1] Add new esams cp hosts to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545857 (https://phabricator.wikimedia.org/T233242) (owner: 10Ema) [14:14:02] (03CR) 10Ema: [C: 03+2] Add new esams cp hosts to cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/545857 (https://phabricator.wikimedia.org/T233242) (owner: 10Ema) [14:15:44] (03CR) 10CDanis: "Haven't yet looked deeply at the code, but here's a few high-level questions and TODOs" (034 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus) [14:16:31] !log power-cycle cp3056, stuck rebooting into d-i T233242 [14:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [14:19:11] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3058.esams.wmnet'] ` The log can be found in `/var/lo... [14:19:57] 10Operations, 10Core Platform Team, 10MediaWiki-ResourceLoader, 10Performance-Team: MediaWiki production config change not/randomly picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Krinkle) [14:20:13] 10Operations, 10Core Platform Team, 10Performance-Team: MediaWiki production config change not/randomly picked up by startup module after deployment - https://phabricator.wikimedia.org/T236366 (10Krinkle) [14:20:48] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 78.33 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:21:17] (03CR) 10CDanis: metamonitoring: add sync of Icinga contacts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [14:21:23] 10Operations, 10ops-eqiad: (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet - https://phabricator.wikimedia.org/T234076 (10ArielGlenn) Awesome. Do you need instructions for the raid setup or is that already taken care of? [14:22:36] !log enable puppet on mw app canaries [14:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:23] !log lvs3006 (upload, inactive) - manual pybal med s/100/90/ (preferred to lvs3004 for fallback from lvs3002) [14:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:11] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) @BBlack here is the information for the CP servers in rack 16 cp3061 : xe-6/0/15 cp3062: xe-6/0/16 cp3063: xe-6/0/17 cp3064: xe-6/0/18 cp3065: xe-6/0/19 [14:24:43] !lvs lvs3002 (upload, active) - manual pybal med s/0/50/ (restart will briefly flip traffic to lvs3006, will come back as primary again, but with non-zero med). [14:26:29] !log lvs3006 (upload, becoming active) - manual pybal med s/90/0/ (will take over from lvs3002, intended permanently). [14:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:42] rlazarus: I saw your review passing by, re: local style for new repos I'd recommend going the 'black' route, i.e. https://phabricator.wikimedia.org/T211750#4851410 [14:30:08] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @BBlack lvs3007 switch information xe-6/0/12 [14:31:42] (03PS1) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) [14:33:24] (03PS1) 10BBlack: esams upload lvs: 6 is primary, [24] are backups [puppet] - 10https://gerrit.wikimedia.org/r/545863 [14:35:05] (03CR) 10Matthias Mullie: "This change is ready for review." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek) [14:38:01] (03PS2) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) [14:38:32] (03CR) 10BBlack: [C: 03+2] esams upload lvs: 6 is primary, [24] are backups [puppet] - 10https://gerrit.wikimedia.org/r/545863 (owner: 10BBlack) [14:38:58] (03CR) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek) [14:39:52] (03PS1) 10Filippo Giunchedi: Introduce Elastic 7 support [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) [14:40:02] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [14:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:09] !log Remove hhvm hhvm-luasandbox hhvm-tidy hhvm-wikidiff2 hhvm-dbg from all canaries and codfw - T229792 [14:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:16] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [14:42:01] !log ema@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [14:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:12] (03CR) 10Matthias Mullie: "This change is ready for review." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek) [14:44:37] jouncebot now [14:44:37] No deployments scheduled for the next 1 hour(s) and 15 minute(s) [14:44:44] jouncebot: next [14:44:45] In 1 hour(s) and 15 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1600) [14:45:24] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Papaul) ganeti3003 switch information xe-6/0/13 [14:46:33] (03PS3) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) [14:47:07] (03CR) 10Matthias Mullie: [C: 03+1] "Not that I know too much about this, but LGTM! :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek) [14:47:43] !log run puppet on all canaries and codfw - T229792 [14:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:52] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [14:49:17] (03CR) 10Filippo Giunchedi: "Should be enough to get the ball rolling!" [puppet] - 10https://gerrit.wikimedia.org/r/545867 (https://phabricator.wikimedia.org/T234854) (owner: 10Filippo Giunchedi) [14:50:06] (03CR) 10Addshore: [C: 03+2] Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek) [14:50:59] (03Merged) 10jenkins-bot: Enable Wikibase client access on testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek) [14:51:18] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) [14:51:39] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Ingest production logs with ELK7 - https://phabricator.wikimedia.org/T235891 (10fgiunchedi) [14:54:53] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3058.esams.wmnet'] ` and were **ALL** successful. [14:58:05] !log cr2-esams - change fallback static route for high-traffic2 to lvs3006 [14:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:34] !log cr3-esams - change fallback static route for high-traffic2 to lvs3006 [14:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:37] !log cr2-esams - add missing lvs3005 IP to bgp pybal neighbor list [15:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:56] (03PS2) 10CDanis: lvs: do not page on karthoterian unavailability [puppet] - 10https://gerrit.wikimedia.org/r/545285 (owner: 10Giuseppe Lavagetto) [15:04:02] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: testcommonswiki, Enable Wikibase client access T223792 (duration: 00m 53s) [15:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:07] T223792: Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items - https://phabricator.wikimedia.org/T223792 [15:06:32] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10fgiunchedi) >>! In T234854#5602802, @Gehel wrote: > Note that APIFeatureUsage has the ELK cluster talk to the Cirrus elasticsearch cluster directly. This means that logstash version on ELK... [15:07:23] (03CR) 10CDanis: [C: 03+2] lvs: do not page on karthoterian unavailability [puppet] - 10https://gerrit.wikimedia.org/r/545285 (owner: 10Giuseppe Lavagetto) [15:09:37] !log pool cp3055 (cache_upload) T233242 [15:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:42] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [15:09:53] !log Remove hhvm packages and enable puppet across the fleet - T229792 [15:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:01] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [15:13:31] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) [15:14:53] 10Operations, 10ops-esams: rack/setup/install ps[12]-oe1[456]-esams - https://phabricator.wikimedia.org/T184066 (10RobH) I have remotely setup both ps1-oe15-esams and ps1-oe16-esams with network configuration. They have NOT had their ports labeled, as this task doesn't list what is plugged into each port. At... [15:15:54] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:16:53] (03CR) 10Filippo Giunchedi: [C: 03+2] "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545802 (https://phabricator.wikimedia.org/T236367) (owner: 10Filippo Giunchedi) [15:18:22] !log Slowly reload apache across the fleet (as we are enabling puppet) - T229792 [15:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:27] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [15:19:12] !log pool cp3058 (cache_text) T233242 [15:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:16] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [15:21:55] (03PS1) 10Mholloway: Revert "lvs::monitor_services: increase number of tries before MCS is critical" [puppet] - 10https://gerrit.wikimedia.org/r/545873 (https://phabricator.wikimedia.org/T229286) [15:22:17] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10RobH) p:05Triage→03Normal [15:22:32] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:22:33] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10RobH) [15:22:43] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10jijiki) I will take a look tomorrow, sorry for delaying this [15:22:58] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar): Messages in Logstash from php-fatal-error.php are missing from type:mediawiki/channel:fatal - https://phabricator.wikimedia.org/T234283 (10jijiki) a:03jijiki [15:24:41] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being recieved from mailing lists - https://phabricator.wikimedia.org/T235983 (10Pine) p:05High→03Unbreak! Does this bug affect emergency@ or legal@? In any case if this is delaying emails to oversighters then I think that UBN prio... [15:25:14] (03PS1) 10CDanis: swift alerts: check over https when appropriate [puppet] - 10https://gerrit.wikimedia.org/r/545874 [15:25:19] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10Aklapper) [15:26:40] godog: thanks! will look [15:27:07] !log asw2-esams: configure port descriptions and vlan/lvs groupings for all rack16 hosts (lvs3007, ganeti3003, bast3004, cp3061-5) [15:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:10] (03CR) 10CDanis: "https://puppet-compiler.wmflabs.org/compiler1002/19048/" [puppet] - 10https://gerrit.wikimedia.org/r/545874 (owner: 10CDanis) [15:29:07] (03PS1) 10Phamhi: icinga: fix permissions for SRE @ WMCS: Hieu Pham [puppet] - 10https://gerrit.wikimedia.org/r/545875 (https://phabricator.wikimedia.org/T228942) [15:29:08] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10ssingh) This bug is also affecting the https://lists.wikimedia.org/mailman/listinfo/traffic-anomaly-report list though it is not a priorit... [15:30:24] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 42.07 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:30:58] (03PS2) 10Phamhi: icinga: fix permissions for SRE @ WMCS: Hieu Pham [puppet] - 10https://gerrit.wikimedia.org/r/545875 (https://phabricator.wikimedia.org/T228942) [15:33:42] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: fix permissions for SRE @ WMCS: Hieu Pham [puppet] - 10https://gerrit.wikimedia.org/r/545875 (https://phabricator.wikimedia.org/T228942) (owner: 10Phamhi) [15:33:56] ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 41.64 le 60 Ema Bried traffic spike on eqsin varnish-fe, nothing to worry about https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:34:03] (03CR) 10Phamhi: [C: 03+2] icinga: fix permissions for SRE @ WMCS: Hieu Pham [puppet] - 10https://gerrit.wikimedia.org/r/545875 (https://phabricator.wikimedia.org/T228942) (owner: 10Phamhi) [15:34:18] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 82.89 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:35:40] (03PS1) 10BBlack: esams: mgmt dns for rack 16 [dns] - 10https://gerrit.wikimedia.org/r/545880 (https://phabricator.wikimedia.org/T236294) [15:37:39] (03CR) 10BBlack: [C: 03+2] esams: mgmt dns for rack 16 [dns] - 10https://gerrit.wikimedia.org/r/545880 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [15:40:10] RECOVERY - OSPF status on cr2-knams is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:40:24] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545874 (owner: 10CDanis) [15:40:46] !log depool cp3030 (cache_text) T233242 [15:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:59] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [15:42:20] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [15:45:42] !log depool cp3034 (cache_upload) T233242 [15:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:28] PROBLEM - Check systemd state on mw1270 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:21] !log depool cp3032 (cache_text) T233242 [15:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:25] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [15:52:41] 10Operations, 10Wikimedia-production-error: mwdebug1001 throws "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" - https://phabricator.wikimedia.org/T236401 (10Urbanecm) [15:57:31] 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10mark) I'm a bit confused; as far as I know the... [15:58:53] (03CR) 10Alexandros Kosiaris: "> Note the difference: you are using quotation marks in your examples, while there are none in the script." [puppet] - 10https://gerrit.wikimedia.org/r/542064 (owner: 10Mobrovac) [15:59:23] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [15:59:37] (03PS3) 10Alexandros Kosiaris: helmfile_log_sal: Fix getting the user and host for logging [puppet] - 10https://gerrit.wikimedia.org/r/542064 (owner: 10Mobrovac) [16:00:05] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1600). [16:00:05] urandom: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Sorry for taking so long to merge this. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/542064 (owner: 10Mobrovac) [16:00:34] ooo, a sticker. tempting. [16:01:02] (03CR) 10RLazarus: "One quick reply, more to come shortly." (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus) [16:02:40] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Performance-Team (Radar): Investigate recurrent GET latency spikes on MediaWiki appservers (Oct 16) - https://phabricator.wikimedia.org/T235872 (10Krinkle) [16:03:07] 10Operations, 10serviceops, 10Performance-Team (Radar): Increased POST latency for MW app servers (Oct 2019) - https://phabricator.wikimedia.org/T235755 (10Krinkle) [16:07:05] !log pool cp3057 (cache_upload) T233242 [16:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:10] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [16:14:21] (03PS2) 10Ottomata: Include hadoop client packages and config on dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) [16:15:44] (03CR) 10Ottomata: "Seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [16:17:36] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ka.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10MarcoAurelio) [16:19:32] (03CR) 10EBernhardson: [C: 03+2] Sort debian/sha256sums explicitely [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/543187 (owner: 10DCausse) [16:20:13] urandom: were you expecting this patch to be swatted? [16:20:22] I still can't tell if anyone actually uses puppet swat [16:20:38] cdanis: yeah [16:20:54] I mean, I was trying to avoid the process of pestering people [16:21:27] but if swat comes and goes, I'll just revert to pestering, and weaponize this to justify why I am :) [16:21:34] ahah [16:22:33] (03PS1) 10MarcoAurelio: (WIP) Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) [16:22:35] urandom: I'm fairly unfamiliar with Cassandra and our use of it, but you aren't, so if you expect this to be non-impactful, and to not require any special handholding (like disabling Puppet on some fraction of the fleet and then slowly re-enabling), I'm happy to +2 and merge [16:23:36] cdanis: most of this is non-normative, changes to comments and the like, meant to manage the signal-to-noise when comparing changes on the next upgrade [16:23:39] (03CR) 10EBernhardson: [C: 03+2] Bump experimental-highlighter to 6.5.4.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/543188 (https://phabricator.wikimedia.org/T236123) (owner: 10DCausse) [16:23:48] there are a couple of changes to defaults that seem harmless [16:24:11] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:24:17] (03PS7) 10CDanis: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans) [16:24:20] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) p:05Triage→03High [16:24:30] cdanis: I was going to restart a few nodes as canaries immediately, holler for a revert if there are issues, and rolling restart later if things look good [16:24:38] urandom: sounds good [16:25:17] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:25:28] (03CR) 10CDanis: [C: 03+2] cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) (owner: 10Eevans) [16:25:46] (03PS2) 10MarcoAurelio: (WIP) Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) [16:25:51] urandom: puppet-merged [16:26:27] (03PS3) 10MarcoAurelio: Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) [16:26:45] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10akosiaris) [16:27:16] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) [16:28:13] !log depool cp3035 (cache_upload) T233242 [16:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:20] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [16:28:27] cdanis: awesome; thanks! [16:29:06] 10Operations, 10DBA, 10serviceops, 10Goal: Switchover backup director service from helium to backup1001 - https://phabricator.wikimedia.org/T236406 (10jcrespo) So because of buster clients and jessie storage daemons cannot talk to each other, we will have to alter slightly the upgrade strategy. Several opt... [16:30:26] (03PS1) 10MarcoAurelio: (WIP) mediawiki::web:prod_sites.pp: Apache config for ka.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) [16:32:02] (03PS2) 10Krinkle: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie) [16:32:34] (03PS3) 10Krinkle: RejectParserCacheValue to reject possibly-corrupted entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie) [16:32:44] (03CR) 10Krinkle: "End date still TBD as it depends on when memc is purged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545647 (https://phabricator.wikimedia.org/T235188) (owner: 10Anomie) [16:32:57] !log restarting cassandra, restbase1016 (canary for config changes) -- T200803 [16:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:02] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [16:34:15] (03PS3) 10Mobrovac: Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) [16:34:37] (03PS2) 10MarcoAurelio: mediawiki::web:prod_sites.pp: Apache config for ka.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) [16:35:43] (03CR) 10Jcrespo: "Yes, I have yet to apply a couple of fixes that both Filippo and Alex mentioned before merging." [puppet] - 10https://gerrit.wikimedia.org/r/544220 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [16:37:50] cdanis: hi. I wanted to test a puppet change with PCC but I'm not sure which host(s) should I choose. Could you help me? [16:38:09] hauskater: sure, link me the change? [16:38:29] cdanis: thank you, it's https://gerrit.wikimedia.org/r/545889 [16:38:49] I looked in site.pp for Apache but found nix [16:39:05] (03CR) 10Subramanya Sastry: Parsoid/PHP: Load the extension on all Parsoid nodes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [16:39:08] and I'd rather not have the change tested on all hosts, that takes lots of time [16:39:12] !log restarting cassandra, restbase2011 (canary for config changes) -- T200803 [16:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:17] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [16:40:24] hauskater: ah, yeah, Apache is not referenced as a role, it's just included from other configs. you should be looking at appservers for that one, I think -- pick a couple machines that are mediawiki::appserver, and a couple mediawiki::appserver::api [16:40:47] cdanis: alright, I'll pick some of those [16:42:15] cdanis: mwdebug won't do the trick right? [16:42:23] mwdebug will work as well [16:42:32] I'll test that then [16:44:53] (03PS3) 10MarcoAurelio: mediawiki::web:prod_sites.pp: Apache config for ka.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) [16:45:02] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [16:45:08] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10fdans) p:05Triage→03High [16:45:43] (03PS1) 10BBlack: Add dhcp macaddrs for esams rack 16 hosts [puppet] - 10https://gerrit.wikimedia.org/r/545893 (https://phabricator.wikimedia.org/T236294) [16:48:04] (03CR) 10BBlack: [C: 03+2] Add dhcp macaddrs for esams rack 16 hosts [puppet] - 10https://gerrit.wikimedia.org/r/545893 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [16:48:33] cdanis: that worked: https://puppet-compiler.wmflabs.org/compiler1001/321/ :) [16:54:14] !log depool cp3036 (cache_upload) T233242 [16:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:20] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [16:55:04] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs3007.esams.wmnet'] ` The log can be found in `/var/log/wmf-au... [16:55:17] 04Critical Alert for device asw2-esams.mgmt.esams.wmnet - Juniper alarm active [16:55:32] I wish that alert was more-informative [16:56:37] XioNoX: ^ [16:56:46] "It's broken, please fix" [17:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1700). [17:00:41] no parsoid deploy today [17:01:37] bblack: that's knows one asw2 doesn't have a 2nd power connected [17:02:05] (03CR) 10Subramanya Sastry: Parsoid/PHP: Load the extension on all Parsoid nodes (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [17:02:21] i can't just mute that alarm without muting everything on the device :( [17:02:25] ok [17:04:41] 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10Dzahn) [17:05:58] (03PS1) 10BBlack: esams mgmt dns for rack 14 [dns] - 10https://gerrit.wikimedia.org/r/545895 (https://phabricator.wikimedia.org/T236294) [17:06:11] (03PS5) 10Aaron Schulz: Set "allow_tcp_nagle_delay" to false in mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 [17:06:28] 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` bast3004.wikimedia.org ` The log can be found in `/var/log/wmf-auto-reimage/201910241705_dzahn_24181_bast3004_wikimedia_org.log`. [17:06:30] 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['bast3004.wikimedia.org'] ` Of which those **FAILED**: ` ['bast3004.wikimedia.org'] ` [17:08:06] (03PS2) 10BBlack: esams mgmt dns for rack 14 [dns] - 10https://gerrit.wikimedia.org/r/545895 (https://phabricator.wikimedia.org/T236294) [17:08:17] (03CR) 10WMDE-leszek: Enable Wikibase client access on testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545862 (https://phabricator.wikimedia.org/T223792) (owner: 10WMDE-leszek) [17:09:10] (03CR) 10BBlack: [C: 03+2] esams mgmt dns for rack 14 [dns] - 10https://gerrit.wikimedia.org/r/545895 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [17:09:32] 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` bast3004.wikimedia.org ` The log can be found in `/var/log/wmf-auto-reimage/201910241709_dzahn_24962_bast3004_wikimedia_org.log`. [17:10:34] (03PS4) 10Mobrovac: Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) [17:12:52] (03CR) 10Subramanya Sastry: [C: 03+1] "Ok, lets go with this ... no need to be ultra paranoid about the linter flag in scenarios where we get $wgReadOnly set to false .. we will" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [17:15:57] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [17:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:37] (03CR) 10Mobrovac: [C: 03+2] Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [17:18:48] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:36] (03Merged) 10jenkins-bot: Parsoid/PHP: Load the extension on all Parsoid nodes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544878 (https://phabricator.wikimedia.org/T235898) (owner: 10Mobrovac) [17:24:18] (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/545289 (https://phabricator.wikimedia.org/T162070) (owner: 10Jbond) [17:26:00] !log mobrovac@deploy1001 Synchronized wmf-config/CommonSettings.php: Enable Parsoid/PHP in the whole wtp (a.k.a. Parsoid) cluster - T236388 (duration: 00m 53s) [17:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:05] T236388: Linting is disabled on beta cluster, but needs to be enabled - https://phabricator.wikimedia.org/T236388 [17:27:45] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3007.esams.wmnet'] ` and were **ALL** successful. [17:29:27] !log asw2-esams - committing switch port/vlan config for new rack 14 hosts [17:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:25] (03PS1) 10Matthias Mullie: Enable Wikibase client access on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) [17:35:06] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) merging in duplicate ticket T236409 where i started OS install [17:35:32] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [17:35:33] 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10Dzahn) [17:35:55] (03CR) 10WMDE-leszek: [C: 03+1] Enable Wikibase client access on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) (owner: 10Matthias Mullie) [17:36:38] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [17:37:25] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) confirmed mgmt and production DNS exists, mgtm password is set, IPMI over LAN working.. started OS install [17:38:38] !log pool cp3059 (cache_upload) T233242 [17:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:43] T233242: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 [17:39:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:00] 10Operations: setup bast3004 - https://phabricator.wikimedia.org/T236409 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['bast3004.wikimedia.org'] ` and were **ALL** successful. [17:58:13] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) ` [bast3004:~] $ gen_fingerprints +---------+---------+-----------------------------------------------------+ | Cipher | Algo | Fingerprint |... [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T1800). [18:00:04] awight, urandom, and matthiasmullie: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] * addshore is watching [18:00:16] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [18:00:25] Present :) [18:00:28] PROBLEM - Host ps1-b4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [18:00:28] I can SWAT today! [18:00:34] o/ [18:00:50] Urbanecm: thank you! [18:01:05] awight: is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/545260 a beta-only patch? [18:01:24] Urbanecm: No, it's a beta feature for production. [18:01:34] ok, a different beta then :) [18:01:45] We should call the beta cluster the "alpha" cluster or something :-) [18:01:54] (03PS2) 10Urbanecm: Reference Previews: full beta deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545260 (https://phabricator.wikimedia.org/T235083) (owner: 10Awight) [18:02:09] well, even our prod wikis runs on official alpha versions of MW :) [18:02:14] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545260 (https://phabricator.wikimedia.org/T235083) (owner: 10Awight) [18:02:18] /o\ good point [18:02:33] o/ [18:02:40] hi urandom [18:02:46] your patch will follow after awight 's [18:03:03] Urbanecm: k [18:03:03] (03PS1) 10BBlack: esams: macaddrs for all new rack 14 hosts [puppet] - 10https://gerrit.wikimedia.org/r/545908 (https://phabricator.wikimedia.org/T236294) [18:03:07] (03Merged) 10jenkins-bot: Reference Previews: full beta deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545260 (https://phabricator.wikimedia.org/T235083) (owner: 10Awight) [18:03:30] awight: could you check your patch at mwdebug1001, please? [18:03:38] Urbanecm: checking... [18:03:39] !log setting ip info for ps1-a6-eqiad, it is rebooting. T227142 [18:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:44] T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 [18:04:18] (03PS6) 10Volans: metamonitoring: add sync of Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) [18:04:40] Urbanecm: It's healthy, ready for deployment. [18:04:40] (03CR) 10BBlack: [C: 03+2] esams: macaddrs for all new rack 14 hosts [puppet] - 10https://gerrit.wikimedia.org/r/545908 (https://phabricator.wikimedia.org/T236294) (owner: 10BBlack) [18:04:57] (03PS4) 10Urbanecm: rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [18:05:07] good, syncing [18:05:13] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [18:05:26] RECOVERY - Host ps1-a6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.13 ms [18:05:43] (03PS1) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [18:05:56] (03Merged) 10jenkins-bot: rename service definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/544199 (https://phabricator.wikimedia.org/T222851) (owner: 10Eevans) [18:05:59] (03CR) 10Volans: "Thanks for the review, replies inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [18:06:17] (03PS2) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 [18:06:22] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [18:06:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: b20d6de: Reference Previews: full beta deployment (T235083) (duration: 00m 52s) [18:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:36] T235083: Full Beta for the ReferencePreviews feature - https://phabricator.wikimedia.org/T235083 [18:06:39] hauskater: thanks for doing the initial config thingie! [18:06:52] I was bored and had a bit of time :) [18:07:12] urandom: if possible, please test your patch at mwdebug1001 [18:07:41] (03PS2) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [18:07:57] (03CR) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. (033 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 (owner: 10RLazarus) [18:08:02] PROBLEM - ps1-a6-eqiad-infeed-load-tower-A-phase-X on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:08:20] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [18:08:21] Urbanecm: hrmm, I'm not sure I know how to do that [18:08:41] 🤔 [18:08:48] PROBLEM - ps1-a6-eqiad-infeed-load-tower-B-phase-Z on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:11] urandom: I'd simply use something that needs the service to ensure it doesn't fail [18:09:18] PROBLEM - ps1-a6-eqiad-infeed-load-tower-A-phase-Z on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:18] PROBLEM - ps1-a6-eqiad-infeed-load-tower-B-phase-X on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:18] PROBLEM - ps1-a6-eqiad-infeed-load-tower-A-phase-Y on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:24] PROBLEM - ps1-a6-eqiad-infeed-load-tower-B-phase-Y on ps1-a6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:29] or I can just sync and verify that in fatalmonitor [18:09:30] Urbanecm: can testwiki be accessed there? [18:09:36] (03PS3) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [18:09:53] urandom: yes, it's a stagging production server, see https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug for manual [18:10:37] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [18:11:21] (03PS4) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [18:11:26] 10Operations, 10Wikimedia-production-error: mwdebug1001 throws "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" - https://phabricator.wikimedia.org/T236401 (10Krinkle) [18:11:32] (03CR) 10CDanis: [C: 03+1] metamonitoring: add sync of Icinga contacts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/545574 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [18:11:41] always a comma [18:11:50] (03PS3) 10RLazarus: Initial version of httpbb, the HTTP black box testing tool. [software/httpbb] - 10https://gerrit.wikimedia.org/r/545689 [18:12:13] 10Operations, 10Wikimedia-production-error: mwdebug1001 throws "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" - https://phabricator.wikimedia.org/T236401 (10Krinkle) [18:12:27] (03PS5) 10MarcoAurelio: Initial configuration for ka.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) [18:12:39] 10Operations, 10Wikimedia-production-error: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10Krinkle) [18:13:10] Urbanecm: perfect [18:13:19] 10Operations, 10Wikimedia-production-error: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10Krinkle) Does not affect production (debug server) and does not seem to affect MediaWiki error logging, either. I... [18:13:43] Urbanecm: looks good [18:13:55] urandom: syncing then [18:15:35] !log urbanecm@deploy1001 Synchronized wmf-config/: SWAT: 84c48df: rename service definition (T222851) (duration: 00m 53s) [18:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:42] T222851: Improve Echo seentime code for multi-DC access - https://phabricator.wikimedia.org/T222851 [18:15:45] urandom: synced [18:15:57] (03PS2) 10Urbanecm: Enable Wikibase client access on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) (owner: 10Matthias Mullie) [18:16:16] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) (owner: 10Matthias Mullie) [18:16:29] matthiasmullie: +2'ed your patch, waiting for CI [18:16:34] Urbanecm: thanks! [18:16:39] yw urandom [18:16:46] Urbanecm: cool thanks [18:17:03] (03Merged) 10jenkins-bot: Enable Wikibase client access on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545903 (https://phabricator.wikimedia.org/T223792) (owner: 10Matthias Mullie) [18:17:40] matthiasmullie: your patch is at mwdebug1001, could you test please? [18:18:00] sure [18:18:05] thanks [18:20:04] !log ps1-a6-eqiad setup complete, icinga errors should clear up T227142 [18:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:08] T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 [18:20:24] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10RobH) 05Open→03Resolved [18:20:26] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [18:21:02] RECOVERY - ps1-a6-eqiad-infeed-load-tower-B-phase-X on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-B-phase-X 393 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:02] RECOVERY - ps1-a6-eqiad-infeed-load-tower-A-phase-Z on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-A-phase-Z 412 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:02] RECOVERY - ps1-a6-eqiad-infeed-load-tower-A-phase-Y on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-A-phase-Y 367 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:04] RECOVERY - ps1-a6-eqiad-infeed-load-tower-B-phase-Z on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-B-phase-Z 339 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:04] RECOVERY - ps1-a6-eqiad-infeed-load-tower-B-phase-Y on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-B-phase-Y 414 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:38] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) a:05Cmjohnson→03RobH [18:21:56] PROBLEM - IPMI Sensor Status on lvs3007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [18:22:05] !log completing ps1-b6-eqiad setup, pdu will reboot twice, power output unaffected T227540 [18:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:09] T227540: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 [18:23:02] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [18:23:04] 10Operations, 10DC-Ops, 10decommission: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn) [18:23:06] There goes one of the restarts [18:23:15] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['lvs3005.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20191024182... [18:23:19] lets see how quickly librenms detects, i assume after its back online [18:23:34] (03PS1) 10Dzahn: site: replace bast3002 with bast3004 [puppet] - 10https://gerrit.wikimedia.org/r/545911 (https://phabricator.wikimedia.org/T236329) [18:23:56] Urbanecm: well... it's not working, but at least nothing else seems broken ^^ [18:24:20] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['dns3001.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto-reimage/2... [18:24:29] matthiasmullie: I'm stupid, I didn't actually pull it :( [18:24:47] matthiasmullie: could you try once more please? [18:24:48] haha :D [18:25:11] sure :) [18:25:14] thanks [18:25:21] !log restbase cassandra rolling restart, rack 'a' -- T200803 [18:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:27] (03CR) 10Nuria: [C: 04-1] "Got it, thank you." [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [18:25:29] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [18:25:45] Urbanecm: seems to work now :) give me 2 min to verify nothing else broke! [18:25:51] sure [18:25:52] (03PS1) 10BBlack: Add dns3001 to ntp peers list [puppet] - 10https://gerrit.wikimedia.org/r/545912 (https://phabricator.wikimedia.org/T236217) [18:26:22] RECOVERY - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [18:26:31] (03CR) 10BBlack: [C: 03+2] Add dns3001 to ntp peers list [puppet] - 10https://gerrit.wikimedia.org/r/545912 (https://phabricator.wikimedia.org/T236217) (owner: 10BBlack) [18:28:15] 04Critical Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - Device rebooted [18:28:35] Urbanecm: LGTM - let's go :) [18:28:38] RECOVERY - Host ps1-b4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [18:28:42] matthiasmullie: syncing [18:28:44] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn) a:03Dzahn @fgiunchedi ACK, so no data needs to be copied from bast3002 to bast3004, the new bastion? Instead it moves to a VM? [18:28:46] RECOVERY - ps1-a6-eqiad-infeed-load-tower-A-phase-X on ps1-a6-eqiad is OK: SNMP OK - ps1-a6-eqiad-infeed-load-tower-A-phase-X 378 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:29:42] PROBLEM - ps1-b4-eqiad-infeed-load-tower-A-phase-Z on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:29:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 263fd0f: Enable Wikibase client access on commonswiki (T223792) (duration: 00m 52s) [18:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:03] T223792: Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items - https://phabricator.wikimedia.org/T223792 [18:30:37] matthiasmullie: done! [18:30:40] !log cr2-esams: add dns3001 to anycast4 neighbors [18:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:48] Urbanecm: thanks! [18:30:51] yw [18:31:05] !log cr3-esams: add dns3001 to anycast4 neighbors [18:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:16] PROBLEM - ps1-b4-eqiad-infeed-load-tower-A-phase-Y on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:31:20] PROBLEM - ps1-b4-eqiad-infeed-load-tower-B-phase-Y on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:31:36] PROBLEM - ps1-b4-eqiad-infeed-load-tower-A-phase-X on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:31:55] ugh the pdu alerts are noisy [18:31:58] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [18:32:06] PROBLEM - ps1-b4-eqiad-infeed-load-tower-B-phase-X on ps1-b4-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:34:00] matthiasmullie: I saw a spike of errors like https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2019.10.24/mediawiki/?id=AW3_C37YghP2xm4v6HnC [18:34:02] please have a look [18:36:00] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3061.esams.wmnet', 'cp3062.esams.wmnet', 'cp3063.esams.wmnet', 'cp3064.e... [18:36:03] checking [18:36:03] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3050.esams.wmnet', 'cp3051.esams.wmnet', 'cp3052.esams.wmnet', 'cp3053.e... [18:36:21] ongoing? or 1 temporary spike [18:36:28] (03PS1) 10RobH: setting sentry4 for ps1-b4-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/545915 (https://phabricator.wikimedia.org/T227540) [18:36:42] matthiasmullie: see https://logstash.wikimedia.org/goto/10c278413af17751ff822d2388bb8a2a [18:36:48] (03PS1) 10Mobrovac: RESTRouter: Add ka.wm.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) [18:37:12] it seems to be ongoing [18:37:34] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: Add ka.wm.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac) [18:38:22] seems to be ongoing, yes [18:38:22] mobrovac: ka or kl ? [18:38:26] (03PS3) 10Mathew.onipe: wdqs: Use a DRYer approach to check selected hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 [18:38:28] (03PS12) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [18:38:49] (03CR) 10RobH: [C: 03+2] setting sentry4 for ps1-b4-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/545915 (https://phabricator.wikimedia.org/T227540) (owner: 10RobH) [18:38:59] matthiasmullie: by looking at today version of same search, it seems it can well be just a coincidence - there was similar spike at around 10am [18:39:13] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn) [18:39:17] also odd: it worked on mwdebug1001 (and still does, also on mwdebug2001) - it doesn't work on whatever random other server I'm on [18:39:38] matthiasmullie: then it's probably cache [18:40:00] try refreshing sev times with ctrl+shift+r or from inkognito window [18:40:01] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn) [18:40:18] (03CR) 10BBlack: RESTRouter: Add ka.wm.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac) [18:40:50] Urbanecm: you ok with giving it 10 more minutes or so, and reverting if things don't settle down then? [18:41:00] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3001.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910241840_dzahn_4514... [18:41:02] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3001.esams.wmnet'] ` Of which those **FAILED**: ` ['ganeti3001.esams.wmnet'] ` [18:41:04] matthiasmullie: yup [18:41:09] (or if errors start spiking more) [18:41:25] cool, fingers crossed [18:41:31] (03PS4) 10Eevans: [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [18:41:33] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3001.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910241841_dzahn_4528... [18:42:05] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:42:05] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:42:07] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) [18:42:08] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [18:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:21] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [18:42:23] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) 05Open→03Resolved All changes merged, when puppet runs on icinga it'll clear the alerts. [18:42:25] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3002.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910241842_dzahn_4545... [18:42:44] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) a:05RobH→03None [18:43:00] (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [18:43:15] 04Critical Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - Device rebooted [18:43:24] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ganeti3003.esams.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201910241843_dzahn_4563... [18:43:26] (03CR) 10Mathew.onipe: wdqs: Use a DRYer approach to check selected hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/545673 (owner: 10Mathew.onipe) [18:43:34] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [18:43:37] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10RobH) 05Open→03Resolved [18:43:46] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) a:05RobH→03None [18:44:11] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) - https://phabricator.wikimedia.org/T227142 (10RobH) a:05RobH→03None [18:44:12] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:25] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10RobH) a:05RobH→03None [18:44:38] RECOVERY - ps1-b4-eqiad-infeed-load-tower-A-phase-Z on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-A-phase-Z 367 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:44:52] 10Operations, 10ops-eqiad, 10DC-Ops: fix serial connection for ps1-a2-eqiad - https://phabricator.wikimedia.org/T235190 (10RobH) 05Open→03Resolved not sure how it was fixed, but it was fixed since we now setup the pdu. [18:44:54] 10Operations, 10ops-eqiad, 10DC-Ops: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) - https://phabricator.wikimedia.org/T227138 (10RobH) [18:45:02] RECOVERY - ps1-b4-eqiad-infeed-load-tower-A-phase-Y on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-A-phase-Y 486 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:45:02] (03PS1) 10Mobrovac: RESTRouter: s/klwm/kawm/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/545918 [18:45:06] RECOVERY - ps1-b4-eqiad-infeed-load-tower-B-phase-Y on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-B-phase-Y 489 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:45:10] 10Operations, 10ops-eqiad, 10DC-Ops: update puppet for new PDU models - https://phabricator.wikimedia.org/T233129 (10RobH) a:05RobH→03None [18:45:24] RECOVERY - ps1-b4-eqiad-infeed-load-tower-A-phase-X on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-A-phase-X 315 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:45:33] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: s/klwm/kawm/ [deployment-charts] - 10https://gerrit.wikimedia.org/r/545918 (owner: 10Mobrovac) [18:46:08] !log restbase cassandra rolling restart, rack 'b' -- T200803 [18:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:13] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [18:46:18] (03CR) 10Mobrovac: [V: 03+2 C: 03+2] RESTRouter: Add ka.wm.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac) [18:46:30] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway) [18:46:41] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway) p:05Triage→03High [18:47:25] 10Operations, 10Machine vision, 10Product-Infrastructure-Team-Backlog (Kanban): Configure Google Cloud Vision credentials in production - https://phabricator.wikimedia.org/T236426 (10Mholloway) [18:49:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device ps1-b4-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted [18:49:54] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10colewhite) This issue is mitigated as of this UTC morning and confirm I am no longer seeing long delays of list email. [18:50:08] PROBLEM - Host 2620:0:862:1:b226:28ff:fe6e:cfe0 is DOWN: PING CRITICAL - Packet loss = 100% [18:51:03] hmmm [18:51:16] did we forget ipv6 for one of these? :) [18:52:45] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3005.esams.wmnet'] ` and were **ALL** successful. [18:53:02] Urbanecm: I'm still getting varying results based on what server serves the request [18:53:06] but the errors seems to have stopped [18:53:17] or it's a temporary bad icinga entry from icinga-ization preceding mapped-v6 or something [18:53:21] so they seem to have been unrelated or resolved [18:53:31] matthiasmullie: indeed [18:53:52] by what server serves the request you mean debug/ordinary, or even different results at ordinary servers? [18:53:59] and have you tried in inkognito window? [18:54:29] some ordinary servers serve the expected result, some don't [18:54:49] the code is there, though - I checked all [18:54:57] interesting [18:55:02] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Lengthy delays in emails being received from mailing lists in October 2019 - https://phabricator.wikimedia.org/T235983 (10ssingh) >>! In T235983#5603254, @ssingh wrote: > This bug is also affecting the https://lists.wikimedia.org/mailman/listinfo/traffic-anoma... [18:55:21] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:55:22] !log Morning SWAT done [18:55:22] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:55:22] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:55:22] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:55:23] yeah that IPv6 host-down is from dns3001, some kind of race between configuration of proper ipv6 and icinga-ization during initial install. it will fix itself later I think [18:55:24] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:55:24] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:55:24] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:28] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:29] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:55:31] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:56] 10Operations, 10ops-esams, 10Traffic: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [18:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:23] Urbanecm: I'll keep an eye on things, but we can probably leave the patch up [18:56:24] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:28] thanks! [18:56:31] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:56:32] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [18:56:33] matthiasmullie: ack, and you're welcome [18:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:29] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:28] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:30] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:00:31] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:25] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:31] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:01:33] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:01:33] PROBLEM - rsyslog in esams is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_wezen.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=esams+prometheus/ops [19:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:14] (03PS1) 10Dzahn: smokeping: replace bast3002 with bast3004 as target [puppet] - 10https://gerrit.wikimedia.org/r/545921 (https://phabricator.wikimedia.org/T236394) [19:02:40] (03CR) 10jerkins-bot: [V: 04-1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [19:03:21] PROBLEM - Recursive DNS on 91.198.174.61 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS [19:03:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:37] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn) @fgiunchedi Is there already a ticket for setting up those new prometheus VMs? I see it needs some related changes in puppet, so i'll block this ticket on that. [19:05:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:05:26] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [19:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:38] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:59] !log restbase cassandra rolling restart, rack 'd' -- T200803 [19:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:03] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [19:06:24] (03PS1) 10Dzahn: hieradata/common: add bast3004 to bastion hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/545922 (https://phabricator.wikimedia.org/T236394) [19:06:26] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Archiva relies on a tmpfs directory that is wiped after each reboot - https://phabricator.wikimedia.org/T214366 (10Nuria) 05Open→03Resolved [19:07:41] PROBLEM - Host cp3052 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:41] PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:41] PROBLEM - Host cp3050 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:49] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:11] (03PS6) 10Eevans: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [19:08:53] (03CR) 10jerkins-bot: [V: 04-1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [19:08:58] RECOVERY - Host cp3053 is UP: PING OK - Packet loss = 0%, RTA = 83.41 ms [19:08:58] RECOVERY - Host cp3050 is UP: PING OK - Packet loss = 0%, RTA = 83.38 ms [19:08:58] RECOVERY - Host cp3052 is UP: PING OK - Packet loss = 0%, RTA = 83.52 ms [19:08:58] RECOVERY - ps1-b4-eqiad-infeed-load-tower-B-phase-X on ps1-b4-eqiad is OK: SNMP OK - ps1-b4-eqiad-infeed-load-tower-B-phase-X 302 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:59] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10Nuria) 05Open→03Resolved [19:09:34] PROBLEM - Host ganeti3001 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:52] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:09:54] well.. those are all failures to set downtime by the script i supose [19:10:00] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:06] (return code 99 furhter above) [19:10:13] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3061 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:10:13] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp3050 is CRITICAL: connect to address 10.20.0.50 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:10:14] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3053 is CRITICAL: connect to address 10.20.0.53 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:10:14] PROBLEM - check_trafficserver_backend_config_status on cp3061 is CRITICAL: NRPE: Command check_check_trafficserver_backend_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:10:14] PROBLEM - Ensure traffic_server is running for instance tls on cp3063 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:10:14] PROBLEM - Ensure traffic_server is running for instance tls on cp3064 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:10:14] PROBLEM - Ensure traffic_manager is running for instance tls on cp3065 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:10:14] PROBLEM - Check systemd state on cp3065 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:34] PROBLEM - HTTPS Unified RSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:10:48] PROBLEM - Varnish HTTP text-frontend - port 80 on cp3050 is CRITICAL: connect to address 10.20.0.50 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:10:54] ^ installation in progress but downtime should have been set automatically by wmf-reimage [19:11:36] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: connect to address 10.20.0.52 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:38] PROBLEM - Ensure traffic_manager is running for instance backend on cp3051 is CRITICAL: NRPE: Command check_traffic_manager_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:38] PROBLEM - Ensure traffic_manager is running for instance tls on cp3050 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:38] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:11:38] PROBLEM - eventlogging Varnishkafka log producer on cp3054 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:11:38] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp3063 is CRITICAL: NRPE: Command check_trafficserver_exporter_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:40] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3064 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:40] PROBLEM - Ensure traffic_server is running for instance backend on cp3065 is CRITICAL: NRPE: Command check_traffic_server_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:11:40] PROBLEM - Webrequests Varnishkafka log producer on cp3065 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:11:40] RECOVERY - Host ganeti3001 is UP: PING OK - Packet loss = 0%, RTA = 83.37 ms [19:12:14] RECOVERY - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.1 200 Ok - 29969 bytes in 0.427 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:13:16] PROBLEM - Ensure traffic_manager is running for instance tls on cp3051 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:13:16] PROBLEM - Ensure traffic_server is running for instance tls on cp3050 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:13:16] PROBLEM - Check systemd state on cp3051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:18] PROBLEM - check_trafficserver_backend_config_status on cp3063 is CRITICAL: NRPE: Command check_check_trafficserver_backend_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:13:18] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3063 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:13:18] PROBLEM - Ensure traffic_server is running for instance tls on cp3065 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:13:18] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3062 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:13:18] hmmm downtimes seem unreliable, sorry for the noise [19:13:42] PROBLEM - Check systemd state on cp3064 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:42] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3001.esams.wmnet'] ` and were **ALL** successful. [19:13:50] RECOVERY - Varnish HTTP text-frontend - port 80 on cp3050 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [19:13:58] PROBLEM - Host cp3054 is DOWN: PING CRITICAL - Packet loss = 100% [19:15:16] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3050 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:16] PROBLEM - Ensure traffic_server is running for instance backend on cp3051 is CRITICAL: NRPE: Command check_traffic_server_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:16] PROBLEM - Ensure traffic_manager is running for instance tls on cp3052 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:16] PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:16] PROBLEM - Webrequests Varnishkafka log producer on cp3051 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:15:16] PROBLEM - Ensure traffic_manager is running for instance backend on cp3053 is CRITICAL: NRPE: Command check_traffic_manager_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:17] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3061 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:18] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp3065 is CRITICAL: NRPE: Command check_trafficserver_exporter_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:15:39] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3003.esams.wmnet'] ` and were **ALL** successful. [19:15:55] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3002.esams.wmnet'] ` and were **ALL** successful. [19:17:24] PROBLEM - Ensure traffic_server is running for instance tls on cp3051 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:17:24] PROBLEM - Ensure traffic_server is running for instance tls on cp3052 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:17:24] PROBLEM - Ensure traffic_manager is running for instance tls on cp3053 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:17:24] PROBLEM - Check systemd state on cp3053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:26] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3065 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:17:26] PROBLEM - check_trafficserver_backend_config_status on cp3065 is CRITICAL: NRPE: Command check_check_trafficserver_backend_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:17:26] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3064 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:18:26] RECOVERY - Ensure traffic_manager is running for instance tls on cp3053 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:18:26] RECOVERY - Check systemd state on cp3053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:46] RECOVERY - Ensure traffic_manager is running for instance backend on cp3053 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:19:00] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3053 is OK: HTTP OK: HTTP/1.0 200 OK - 19414 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:20:44] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3051 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:20:44] PROBLEM - check_trafficserver_backend_config_status on cp3051 is CRITICAL: NRPE: Command check_check_trafficserver_backend_config_status not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:20:46] PROBLEM - eventlogging Varnishkafka log producer on cp3062 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:20:46] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3050 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:20:56] PROBLEM - Check systemd state on cp3052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:16] RECOVERY - Ensure traffic_server is running for instance tls on cp3063 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:21:18] RECOVERY - check_trafficserver_backend_config_status on cp3063 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:21:18] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3063 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:21:28] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp3063 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:21:56] RECOVERY - eventlogging Varnishkafka log producer on cp3062 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:22:26] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3061 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:22:28] RECOVERY - check_trafficserver_backend_config_status on cp3061 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:22:32] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3062 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:22:54] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3065 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:23:00] jouncebot: now [19:23:00] No deployments scheduled for the next 3 hour(s) and 36 minute(s) [19:23:04] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3052 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:23:04] bblack: are you in the middle of something? I wanted to do a quick gerrit restart: would that interfere with what you're doing? [19:23:26] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3061 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:23:46] RECOVERY - Ensure traffic_server is running for instance tls on cp3050 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:24:00] RECOVERY - Ensure traffic_manager is running for instance tls on cp3050 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:24:30] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3050 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:24:40] bblack: ack, i had it too. (exit_code=99) when it tries to set downtimes [19:24:42] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3050 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:24:44] RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:02] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp3052 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:25:04] PROBLEM - eventlogging Varnishkafka log producer on cp3064 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:25:08] RECOVERY - Check systemd state on cp3051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:25:08] RECOVERY - Ensure traffic_manager is running for instance tls on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:25:20] RECOVERY - Ensure traffic_manager is running for instance backend on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:25:36] RECOVERY - Ensure traffic_server is running for instance tls on cp3051 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:25:40] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3052 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:25:48] RECOVERY - check_trafficserver_backend_config_status on cp3051 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:25:48] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:02] RECOVERY - Ensure traffic_server is running for instance backend on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:02] RECOVERY - Ensure traffic_manager is running for instance tls on cp3052 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:02] RECOVERY - Webrequests Varnishkafka log producer on cp3051 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:26:02] RECOVERY - Check systemd state on cp3052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:20] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3052 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:36] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 19440 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:26:37] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn) [19:26:58] RECOVERY - Ensure traffic_server is running for instance tls on cp3052 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:27:02] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3065 is CRITICAL: connect to address 10.20.0.65 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:28:01] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn) - OS installed - in puppet with spare role - set to "staged" in netbox Are all previous boxes already done? If so then it can be handed over to Filippo i think. [19:28:04] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:28:06] RECOVERY - Check systemd state on cp3064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:12] RECOVERY - Recursive DNS on 91.198.174.61 is OK: DNS OK: 0.102 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS [19:28:20] PROBLEM - IPMI Sensor Status on cp3051 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:28:28] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3064 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:29:14] RECOVERY - eventlogging Varnishkafka log producer on cp3064 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:29:16] RECOVERY - Ensure traffic_server is running for instance tls on cp3064 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:30:07] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dns3001.wikimedia.org'] ` and were **ALL** successful. [19:30:11] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [19:30:14] RECOVERY - rsyslog in esams is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=esams+prometheus/ops [19:30:58] (03CR) 10BBlack: [C: 03+1] hieradata/common: add bast3004 to bastion hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/545922 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [19:31:02] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 667 days) https://wikitech.wikimedia.org/wiki/Logs [19:31:19] (03CR) 10Dzahn: [C: 03+2] hieradata/common: add bast3004 to bastion hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/545922 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [19:31:25] !log restbase cassandra rolling restart, codfw / rack 'b' -- T200803 [19:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:31] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [19:31:50] PROBLEM - IPMI Sensor Status on cp3053 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:32:17] !log reboot dns3001 [19:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:51] (03PS7) 10Eevans: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [19:33:10] PROBLEM - Ensure traffic_manager is running for instance backend on cp3065 is CRITICAL: NRPE: Command check_traffic_manager_backend not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:34:04] (03CR) 10jerkins-bot: [V: 04-1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [19:34:18] PROBLEM - IPMI Sensor Status on ganeti3001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:35:14] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:35:38] RECOVERY - Ensure traffic_manager is running for instance tls on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:35:38] RECOVERY - Check systemd state on cp3065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:40] RECOVERY - Ensure traffic_server is running for instance tls on cp3065 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:35:56] RECOVERY - Ensure traffic_server is running for instance backend on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/traffic_server -M --httpport 3128 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:35:58] RECOVERY - Webrequests Varnishkafka log producer on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:36:08] PROBLEM - Host 2620:0:862:1:b226:28ff:fe6e:cfe0 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:1:b226:28ff:fe6e:cfe0) [19:36:10] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp3065 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:36:13] (03PS8) 10Eevans: Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [19:36:16] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3062.esams.wmnet', 'cp3063.esams.wmnet', 'cp3061.esams.wmnet', 'cp3064.esams.wmnet', 'cp3065.esams.wmnet'] ` and were **A... [19:36:18] RECOVERY - check_trafficserver_backend_config_status on cp3065 is OK: OK: configuration is current https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:36:18] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:36:22] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3065 is OK: HTTP OK: HTTP/1.0 200 OK - 19417 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:36:30] RECOVERY - Ensure traffic_manager is running for instance backend on cp3065 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:39:40] PROBLEM - IPMI Sensor Status on cp3062 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:41:30] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1045 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:41:40] PROBLEM - Check systemd state on ms-be1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:40] PROBLEM - IPMI Sensor Status on cp3061 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:41:56] PROBLEM - IPMI Sensor Status on ganeti3003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:43:04] PROBLEM - IPMI Sensor Status on lvs3005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:43:40] PROBLEM - IPMI Sensor Status on cp3064 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:45:38] PROBLEM - IPMI Sensor Status on cp3063 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:46:03] ^ known .. silencing the PDU part of it [19:46:49] (03PS2) 10CDanis: swift alerts: check over https when appropriate [puppet] - 10https://gerrit.wikimedia.org/r/545874 [19:47:03] PROBLEM - IPMI Sensor Status on cp3050 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:22] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3050 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:22] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3051 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:22] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3052 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:22] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3053 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:22] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3061 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:22] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3062 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:22] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3063 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:23] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3064 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:23] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3065 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:24] ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti3001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:24] ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti3003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:25] ACKNOWLEDGEMENT - IPMI Sensor Status on lvs3005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:47:25] ACKNOWLEDGEMENT - IPMI Sensor Status on lvs3007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams work in progress - only one PDU so far https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:48:24] one last spam but it won't repeat now until the next change and then we see those once we got the second power supply [19:49:22] I'm still getting the cps through their bringup stuff, not expecting them to alert-free yet [19:50:03] (03CR) 10CDanis: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19051/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/545874 (owner: 10CDanis) [19:50:21] RECOVERY - Host cp3054 is UP: PING OK - Packet loss = 0%, RTA = 83.51 ms [19:52:05] bblack: do you want to see the alerts or should i downtime ? [19:52:07] PROBLEM - traffic-pool service on cp3054 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:52:33] PROBLEM - eventlogging Varnishkafka log producer on cp3054 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:52:39] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:52:41] PROBLEM - Ensure trafficserver_exporter is running for instance tls on cp3054 is CRITICAL: NRPE: Command check_trafficserver_exporter_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:52:41] PROBLEM - Ensure traffic_server is running for instance tls on cp3054 is CRITICAL: NRPE: Command check_traffic_server_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:52:41] PROBLEM - Check systemd state on cp3054 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:43] PROBLEM - Webrequests Varnishkafka log producer on cp3054 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:52:49] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:52:53] I need help with https://phabricator.wikimedia.org/T235677, as I'm (and the Design team) is blocked on it. [19:53:11] PROBLEM - ats-tls HTTPS en.wikipedia.org ECDSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:53:11] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:53:15] PROBLEM - HTTPS Unified ECDSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:53:25] PROBLEM - Ensure traffic_manager is running for instance tls on cp3054 is CRITICAL: NRPE: Command check_traffic_manager_tls not defined https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:53:29] PROBLEM - HTTPS Unified RSA on cp3054 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [19:53:31] PROBLEM - statsv Varnishkafka log producer on cp3054 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [19:54:32] ACKNOWLEDGEMENT - IPMI Sensor Status on bast3004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [19:54:44] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3054 is CRITICAL: connect to address 10.20.0.54 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [19:55:24] scheduled 1h downtime for services on cp3054 [19:56:28] RECOVERY - traffic-pool service on cp3054 is OK: OK - traffic-pool is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:57:30] !log cp3054 - trying racadm serveraction hardreset [19:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:21] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [19:59:22] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Dzahn) [19:59:24] (03CR) 10MarcoAurelio: "This step (uploading a patch against this repo) doesn't appear to be documented at https://wikitech.wikimedia.org/wiki/Add_a_wiki. We shou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac) [19:59:27] 10Operations, 10ops-esams, 10DNS, 10Traffic: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Dzahn) [20:00:17] (03CR) 10BPirkle: [C: 03+1] Config changes for Echo kask migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540731 (https://phabricator.wikimedia.org/T222851) (owner: 10Catrope) [20:01:14] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn) 05Open→03Stalled These need to stay on role(spare) for now, per bblack "we haven't figured out the edge DC ganeti cluster configs yet" [20:01:24] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:01:28] RECOVERY - ats-tls HTTPS en.wikipedia.org ECDSA on cp3054 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 599712 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:01:28] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp3054 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 546072 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:01:28] RECOVERY - HTTPS Unified ECDSA on cp3054 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 599711 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-10-06 12:00:00 +0000 (expires in 347 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:02:03] 10Operations, 10ops-esams, 10DC-Ops: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790 (10Dzahn) 05Stalled→03Open [20:02:05] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Dzahn) [20:02:40] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:02:45] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10Dzahn) [20:02:46] RECOVERY - Webrequests Varnishkafka log producer on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:03:26] RECOVERY - Ensure traffic_manager is running for instance tls on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --run-root=/srv/trafficserver/tls --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:03:36] RECOVERY - eventlogging Varnishkafka log producer on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:03:52] RECOVERY - Check systemd state on cp3054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:52] RECOVERY - Ensure traffic_server is running for instance tls on cp3054 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 8443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:03:52] RECOVERY - Ensure trafficserver_exporter is running for instance tls on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint https://127.0.0.1:8443/_stats --port 9322 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:04:02] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:04:05] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3053.esams.wmnet', 'cp3050.esams.wmnet', 'cp3051.esams.wmnet', 'cp3052.esams.wmnet', 'cp3054.esams.wmnet'] ` and were **A... [20:04:34] !log reboot cp3054 again for good measure [20:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:30] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3054 is OK: HTTP OK: HTTP/1.1 200 OK - 540 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/Varnish [20:07:24] ACKNOWLEDGEMENT - Juniper alarms on csw2-esams is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms Ayounsi known, only 1 poer feed until tomorrow - The acknowledgement expires at: 2019-10-25 12:06:55. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [20:09:57] RECOVERY - statsv Varnishkafka log producer on cp3054 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [20:11:59] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) ^ made active bastion host with the global firewall change above created wikitech pages https://wikitech.wikimedia.org/wiki/Bast3004 https://wikitech.wikimedia.org/w... [20:15:05] 10Operations, 10ops-esams, 10DC-Ops, 10Patch-For-Review: ESAMS Refresh/Rebuild (October 2019) - https://phabricator.wikimedia.org/T235805 (10Papaul) [20:17:47] (03CR) 10Dzahn: [C: 03+2] smokeping: replace bast3002 with bast3004 as target [puppet] - 10https://gerrit.wikimedia.org/r/545921 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [20:19:10] (03CR) 10Ayounsi: [C: 03+1] "LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/544881 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:21:23] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) [20:22:41] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3054.esams.wmnet [20:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:52] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Papaul) [20:23:00] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3053.esams.wmnet [20:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:10] 10Operations, 10ops-esams, 10Traffic: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) [20:24:13] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3033.esams.wmnet [20:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:31] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3038.esams.wmnet [20:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:45] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) a:05Papaul→03Dzahn [20:25:04] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) service "bastion host" is ready but service "tftp" still needs to be migrated. taking it. [20:29:15] (03PS1) 10CDanis: wikimedia.org: add performance-graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) [20:30:45] (03PS2) 10CDanis: wikimedia.org: add performance-graphite [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) [20:33:12] (03CR) 10Bstorm: "For reference, the puppet compiler bit: https://puppet-compiler.wmflabs.org/compiler1001/19052/" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [20:33:38] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [20:33:50] !log restbase cassandra rolling restart, codfw / rack 'c' -- T200803 [20:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:55] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [20:36:40] !log esams lvs: high-traffic1 - change 3003's med to 200, 3001's med to 50, 3005 remains 100 (traffic will blip to 3005 then back to 3001 again) [20:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:56] 10Operations, 10ops-eqiad: rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10RobH) p:05Triage→03Normal [20:40:07] 10Operations, 10ops-eqiad: rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10RobH) [20:40:45] !log esams lvs: high-traffic1 - change 3005's med to 0 (becomes new primary, permanently) [20:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:16] 10Operations, 10ops-eqiad: rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10RobH) a:03Joe IRC update: Chatted with @joe and I'm assigning this to him first so he can evaluate where he wants to have all of these incoming mw hosts racked. He'll update this task, and pl... [20:42:12] (03PS1) 10Dzahn: Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935 [20:43:57] (03PS2) 10Dzahn: Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935 [20:44:39] (03PS1) 10BBlack: esams text lvs: 5 is primary, [13] are backups [puppet] - 10https://gerrit.wikimedia.org/r/545936 [20:45:25] (03CR) 10Dzahn: [C: 03+2] "need something like this again for migration another bastion. first restoring and then moving to profile and adjusting IPs" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (owner: 10Dzahn) [20:45:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (owner: 10Dzahn) [20:46:09] (03PS3) 10Dzahn: Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) [20:46:11] (03CR) 10BBlack: [C: 03+2] esams text lvs: 5 is primary, [13] are backups [puppet] - 10https://gerrit.wikimedia.org/r/545936 (owner: 10BBlack) [20:48:03] (03CR) 10jerkins-bot: [V: 04-1] Revert "delete unused bastionhost::migration class" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [20:52:29] (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/545550 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [20:58:10] !log cr2-esams switch high-traffic1 static fallback routes from lvs3001 to lvs3005 [20:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:19] !log cr3-esams switch high-traffic1 static fallback routes from lvs3001 to lvs3005 [20:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:39] !log downtimed lvs3001-4, stopping pybal there, etc... [21:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:35] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236438 (10ops-monitoring-bot) [21:05:46] !log restbase cassandra rolling restart, codfw / rack 'd' -- T200803 [21:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:53] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [21:05:54] !log cr2-esams remove pybal neighbor IPs for lvs3001-4 [21:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:17] !log cr3-esams remove pybal neighbor IPs for lvs3001-4 [21:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:35] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236438 (10Bstorm) ` Enclosure Device ID: 32 Slot Number: 4 Enclosure position: 1 Device Id: 4 WWN: 55cd2e414dae9475 Sequence Number: 4 Media Error Count: 0 Other Error Count: 98287 Predictive Failure Count: 0 Last Pr... [21:10:32] (03PS4) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) [21:10:34] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236438 (10Bstorm) [21:10:37] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10Bstorm) [21:12:02] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236331 (10bd808) [21:12:04] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T236438 (10bd808) [21:12:31] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3039.esams.wmnet [21:12:34] PROBLEM - IPMI Sensor Status on dns3001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:35] (03CR) 10jerkins-bot: [V: 04-1] bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [21:13:00] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3044.esams.wmnet [21:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:11] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3051.esams.wmnet [21:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:01] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp3050.esams.wmnet [21:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:12] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp3040.esams.wmnet [21:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:12] PROBLEM - IPMI Sensor Status on cp3054 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [21:20:20] (03PS1) 10Subramanya Sastry: Direct Parsoid/PHP logs to a parsoid-php logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899) [21:22:58] (03PS1) 10BBlack: lvs3001-4: unconfigure and move to spare [puppet] - 10https://gerrit.wikimedia.org/r/545946 [21:24:42] (03CR) 10Subramanya Sastry: "In a few months time, once Parsoid/PHP logging has been fine-tuned and we are well past the deployment, we can revert this change if we de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545944 (https://phabricator.wikimedia.org/T235899) (owner: 10Subramanya Sastry) [21:27:39] (03CR) 10BBlack: [C: 03+2] lvs3001-4: unconfigure and move to spare [puppet] - 10https://gerrit.wikimedia.org/r/545946 (owner: 10BBlack) [21:31:13] (03PS5) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) [21:33:08] (03CR) 10jerkins-bot: [V: 04-1] bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [21:44:44] (03PS1) 10Dzahn: mediawiki::webserver: disable hhvm-needs-restart cron [puppet] - 10https://gerrit.wikimedia.org/r/545950 (https://phabricator.wikimedia.org/T229792) [21:46:43] 10Operations, 10Traffic, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green), 10Patch-For-Review: Implement basic routing for rest.php - https://phabricator.wikimedia.org/T235779 (10WDoranWMF) @BBlack @ema would you have anytime to review Petr's patch above? [21:49:59] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19053/mw1343.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/545950 (https://phabricator.wikimedia.org/T229792) (owner: 10Dzahn) [21:56:51] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10Dzahn) merged the above because we were getting cron spam from appservers with "/usr/local/bin/hhvm-needs-restart: not found" [21:59:53] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10Volker_E) We've reverted Git LFS for now in https://github.com/wikimedi... [22:00:23] RECOVERY - Check systemd state on mw1270 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:01:05] !log mw1270 - was alerting in Icinga as degraded systemd state - reason was 'hhvm.service not-found". systemctl reset-failed cleared it. could cause monitoring spam on more servers (T229792) [22:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:10] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [22:09:02] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [22:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:39] !log stopping gerrit briefly for script run for T236344 [22:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:44] T236344: Some All-Users.git references are outdated after gerrit1001 migration - https://phabricator.wikimedia.org/T236344 [22:11:07] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:12] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [22:12:13] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [22:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:15] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:19] gerrit's down, is that known? [22:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:05] MatmaRex: i dont think so, no. logging in [22:13:08] thcipriani: ^ [22:13:16] MatmaRex, looks like thcipriani stopped it above for running a script [22:13:16] "Gerrit is down. We're working on bringing it back as soon as possible. [22:13:17] Please follow along the discussion at #wikimedia-operations on freenode as we debug. [22:13:17] Please try again later!" [22:13:29] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [22:13:30] PROBLEM - gerrit process on gerrit1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [22:13:32] PROBLEM - Check systemd state on deploy1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:50] PROBLEM - Check systemd state on contint1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:51] !log gerrit1001 - starting gerrit [22:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:16] mutante, looks like th.cipriani stopped it for T236344? [22:14:19] mutante: MatmaRex I logged it [22:14:19] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:32] was just about to start it again [22:14:34] PROBLEM - Check the last execution of git_pull_charts on contint1001 is CRITICAL: CRITICAL: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:14:44] RECOVERY - gerrit process on gerrit1001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [22:15:03] thcipriani: i hope starting it was ok ... [22:15:04] RECOVERY - Check systemd state on contint1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:46] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:50] sorry, i also missed the log line somehow [22:16:25] mutante: should be fine. I was being paranoid. Preventing a very unlikely race condition. I ran the script right after stopping it, was checking some things when I noticed it was already started. I think should be ok :) [22:16:26] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:38] thcipriani: phew, ok! [22:18:24] (03PS1) 10Bstorm: maintain-kubeusers: add ability to merge and update configs [puppet] - 10https://gerrit.wikimedia.org/r/545966 (https://phabricator.wikimedia.org/T236202) [22:19:58] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T230575 (10wiki_willy) a:05Bstorm→03Jclark-ctr Talked to our Dell rep on this one, who can reach out to the Dell tech support rep directly, after we re-open the ticket. He basically confirmed the same thing @Bsto... [22:20:07] (03PS6) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) [22:20:43] ACKNOWLEDGEMENT - IPMI Sensor Status on dns3001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] daniel_zahn esams buildout WIP https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:21:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) Pointed this task out to our Dell account rep today. @Jclark-ctr - let me know if the steps they provided don... [22:24:45] RECOVERY - Check the last execution of git_pull_charts on contint1001 is OK: OK: Status of the systemd unit git_pull_charts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:31:39] (03PS7) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) [22:33:31] (03CR) 10Dzahn: [C: 03+2] Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [22:33:36] (03PS4) 10Dzahn: Set DNS configuration for ka.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [22:37:50] (03CR) 10Dzahn: "@RLazarus: perfect example for a small change to Apache config that needs testing / deployment." [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [22:44:46] (03PS8) 10Dzahn: bastionhost: recreate migration class, convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) [22:44:53] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/19055/" [puppet] - 10https://gerrit.wikimedia.org/r/545935 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [22:47:37] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10wiki_willy) a:03RobH @RobH - can you check if the configuration on this one is complete? It was one of the PDUs you and Chris upgraded, when you went out to eqiad. Thanks, Willy [22:56:39] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10RobH) 05Open→03Resolved >>! In T227143#5604899, @wiki_willy wrote: > @RobH - can you check if the configuration on this one is complete? It was one of the PDUs you and Chris upgraded, when you went o... [22:56:42] 10Operations, 10ops-eqiad, 10DC-Ops: Install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191024T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:14] (03PS1) 10Dzahn: bastionhost: fix migration class fqdn comparison, rename vars [puppet] - 10https://gerrit.wikimedia.org/r/545971 (https://phabricator.wikimedia.org/T236394) [23:02:19] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10mmodell) So I think this should really be deployed with scap. That woul... [23:03:21] (03PS1) 10Dzahn: DHCP: replace bast3002 with bast3004 as next-server [puppet] - 10https://gerrit.wikimedia.org/r/545973 (https://phabricator.wikimedia.org/T236394) [23:04:29] mutante: would you be able to help with https://phabricator.wikimedia.org/T235677 once more? [23:05:01] (03PS2) 10Dzahn: DHCP: replace bast3002 with bast3004 as next-server [puppet] - 10https://gerrit.wikimedia.org/r/545973 (https://phabricator.wikimedia.org/T236394) [23:06:28] Volker_E: kind of busy right now. i think the latest comment by Mukunda is the way to go though. [23:06:42] did you need a quick fix or something or in general [23:06:56] very quick fix [23:06:59] mutante: not sure where to find the puppet logs for Microsites [23:07:00] we're blocked [23:07:08] we=Design team [23:07:26] twentyafterfour: what do you want to know from puppet log? [23:07:27] mutante: I can catch up with you later about it though if you are busy [23:07:50] the error was already on the ticket, wasnt it [23:07:51] mutante: well, previously the git::clone had errors and we've force-pushed and update to gerrit so that it won't use git-lfs [23:07:59] but the change didn't show up apparently [23:08:25] ok, running puppet to see [23:08:41] I'm going to write a patch to switch it to scap, but I'll need review, will ping you about that later if it's ok [23:09:04] Error: '/usr/bin/git pull --quiet' returned 128 instead of one of [0] [23:09:50] mutante: thanks, go back to what you were doing I'll try to help Volker_E [23:09:56] :) [23:09:56] (/Stage[main]/Profile::Microsites::Design/Git::Clone[design/style-gu ide]/Exec[git_pull_design/style-guide]/returns) fatal: refusing to merge unrelated histories [23:09:59] this says more [23:10:08] unrelated histories.. [23:10:16] that's kind of what you'd expect if it was force-pushed [23:10:16] ok! [23:10:49] (03CR) 10Jon Harald Søby: "This might be too late, but our naming conventions for country-based community groups (and chapters) is to use countrycode.wikimedia.org. " [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [23:10:58] that is, if your remote is completely different than your local copy and you pull quietly, git quietly refuses to act :) [23:11:52] (03PS2) 10Cwhite: admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) [23:12:09] (03CR) 10jerkins-bot: [V: 04-1] admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [23:12:24] 10Operations, 10DBA, 10Data-Services: Prepare and check storage layer for ka.wikimedia.org - https://phabricator.wikimedia.org/T236404 (10colewhite) p:05Triage→03Normal [23:13:07] thcipriani: right but shouldn't it be cloning fresh? [23:13:17] IIRC this has something to do with the + in the refspec [23:13:19] 10Operations, 10Wikimedia-production-error: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10colewhite) p:05Triage→03Normal [23:13:26] ahh I see [23:13:35] ensure => latest causes it to pull [23:13:56] 10Operations, 10observability: Tune HTTP availability alerts - https://phabricator.wikimedia.org/T236367 (10colewhite) p:05Triage→03Normal [23:14:12] yes. we can delete everything and then run puppet [23:14:24] mutante: that is probably the easiest fix [23:14:34] but they need to stop making changes that require a force-push ;) [23:14:38] but that could mean pages get cached while there is no content [23:14:50] mutante: yeah [23:14:54] > Since Git version 2.20, fetching to update refs/tags/* works the same way as when pushing. I.e. any updates will be rejected without + in the refspec (or --force) [23:14:56] and i wont be able to start purging them right now [23:15:17] so adding a + to the refspec may fix puppet [23:15:19] thcipriani: yeah I think ensure=>latest should probably add --force [23:15:28] 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite) [23:15:43] I don't know if that's the least surprising thing it could do [23:15:44] 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite) p:05Triage→03Normal a:03colewhite [23:16:25] the consequence of adding --force could be worse than a bad puppet run [23:16:28] (03CR) 10Dzahn: [C: 03+2] bastionhost: fix migration class fqdn comparison, rename vars [puppet] - 10https://gerrit.wikimedia.org/r/545971 (https://phabricator.wikimedia.org/T236394) (owner: 10Dzahn) [23:17:09] thcipriani: well, wouldn't that grab whatever is on gerrit, which matches my expectation for what is "latest" [23:17:30] in general it seems like a bad idea to have puppet ensuring latest [23:17:36] ^ [23:17:53] that's why we usually have 2 repos, normal and deployment-repo [23:17:58] right [23:18:17] scap is for sure the solution but trying to quickly get things unblocked [23:18:30] I'll write up the scap patch [23:18:48] * Krinkle staging on mwdebug1001 [23:19:02] ~(a puppet patch to switch it to deploy via scap) [23:19:16] twentyafterfour: OK to use scap now for a MW fix? [23:19:28] Krinkle: yes I think so [23:20:27] Tchanders: staged on mwdebug1001 [23:20:40] 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite) [23:21:15] Krinkle Thanks, checking [23:21:34] (03PS1) 10Cwhite: admin: add cglenn to researchers and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/545974 (https://phabricator.wikimedia.org/T236321) [23:22:41] Krinkle All good [23:23:30] (03PS3) 10Cwhite: admin: add Kevin Bazira to several groups [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) [23:24:12] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.3/includes/specials/pagers/BlockListPager.php: T236425, fc99c5a7c0de2 (duration: 00m 54s) [23:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:18] T236425: Fatal Error: "Call to a member function getId() on string" from BlockListPager.php - https://phabricator.wikimedia.org/T236425 [23:25:45] (03CR) 10Jon Harald Søby: [C: 04-1] "Should be ge.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [23:26:23] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/545418 (https://phabricator.wikimedia.org/T234209) (owner: 10Cwhite) [23:27:40] (03CR) 10Jon Harald Søby: [C: 04-1] "Should be gewikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/545909 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [23:30:03] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10colewhite) Hi CherRaye! I added the checklist and need a couple things in order to proceed: an SSH public key and sign off... [23:30:49] (03PS1) 10Dzahn: bastionhost::migration: comment out rsync server, duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545976 [23:31:36] (03PS2) 10Dzahn: bastionhost::migration: comment out rsync server, duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545976 [23:31:41] (03CR) 10Dzahn: [C: 03+2] bastionhost::migration: comment out rsync server, duplicate [puppet] - 10https://gerrit.wikimedia.org/r/545976 (owner: 10Dzahn) [23:33:09] (03PS1) 10Dzahn: Revert "Set DNS configuration for ka.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/545977 [23:33:57] (03CR) 10Dzahn: "> Patch Set 4:" [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [23:35:26] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 8 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Krinkle) [23:35:33] (03CR) 10Jon Harald Søby: [C: 03+1] "Thanks. :-)" [dns] - 10https://gerrit.wikimedia.org/r/545977 (owner: 10Dzahn) [23:36:13] (03CR) 10Dzahn: [C: 03+2] Revert "Set DNS configuration for ka.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/545977 (owner: 10Dzahn) [23:37:47] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:39:32] (03CR) 10Dzahn: [C: 04-1] "ack. it should use "ge" instead." [puppet] - 10https://gerrit.wikimedia.org/r/545889 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [23:41:27] (03CR) 10Dzahn: RESTRouter: Add ka.wm.org (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac) [23:41:34] (03CR) 10Dzahn: ""ka" was actually a wrong request. it should use "ge" instead." [deployment-charts] - 10https://gerrit.wikimedia.org/r/545916 (https://phabricator.wikimedia.org/T236389) (owner: 10Mobrovac) [23:43:19] (03PS1) 10Dzahn: add ge.wikimedia.org for Georgia chapter [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) [23:46:41] !log bast3002 - rsyncing /home, /srv/tfptboot and /srv/prometheus to /srv/bast3002/ on bast3004 (T236394 T236329) [23:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:51] T236329: decommission bast3002 - https://phabricator.wikimedia.org/T236329 [23:46:52] T236394: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 [23:49:49] 10Operations, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission bast3002 - https://phabricator.wikimedia.org/T236329 (10Dzahn) >>! In T236329#5601691, @fgiunchedi wrote: > Also a note for when the time comes: there's Prometheus data on this host that will need to be migrated onto a VM on esams' ga... [23:50:03] (03CR) 10Krinkle: "Should this be under *.wikimedia.org? For JS/HTML security (CORS) might make sense to place elsewhere just in case. E.g. wmfusercontent.or" [dns] - 10https://gerrit.wikimedia.org/r/545934 (https://phabricator.wikimedia.org/T231870) (owner: 10CDanis) [23:53:38] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/dns/+/545979" [dns] - 10https://gerrit.wikimedia.org/r/545977 (owner: 10Dzahn) [23:54:20] (03CR) 10Dzahn: "reverted and made https://gerrit.wikimedia.org/r/c/operations/dns/+/545979" [dns] - 10https://gerrit.wikimedia.org/r/545888 (https://phabricator.wikimedia.org/T236389) (owner: 10MarcoAurelio) [23:56:27] PROBLEM - snapshot of s4 in eqiad on db1115 is CRITICAL: snapshot for s4 at eqiad taken more than 4 days ago: Most recent backup 2019-10-20 23:28:32 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [23:56:53] (03CR) 10Jon Harald Søby: [C: 03+1] add ge.wikimedia.org for Georgia chapter [dns] - 10https://gerrit.wikimedia.org/r/545979 (https://phabricator.wikimedia.org/T236389) (owner: 10Dzahn) [23:58:23] (03PS1) 10Milimetric: Sync geoeditors data to dumps and add links [puppet] - 10https://gerrit.wikimedia.org/r/545982 (https://phabricator.wikimedia.org/T131280)