[02:57:14] !log volker-e@deploy1001 Started deploy [design/style-guide@cebc152]: Deploy design/style-guide: [02:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:21] !log volker-e@deploy1001 Finished deploy [design/style-guide@cebc152]: Deploy design/style-guide: (duration: 00m 07s) [02:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:07] !log volker-e@deploy1001 Started deploy [design/style-guide@8bec25e]: Deploy design/style-guide: [04:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:14] !log volker-e@deploy1001 Finished deploy [design/style-guide@8bec25e]: Deploy design/style-guide: (duration: 00m 07s) [04:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:15] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 141 probes of 591 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:35:29] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 591 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [06:19:11] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 51284592 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:20:59] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 26576 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:29:29] PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:36] PROBLEM - DPKG on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:47] PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:29:51] PROBLEM - MD RAID on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:53] PROBLEM - Check size of conntrack table on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:29:55] PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:29:59] PROBLEM - dhclient process on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:09] PROBLEM - Check systemd state on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:13] PROBLEM - Check whether ferm is active by checking the default input chain on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:13] PROBLEM - DPKG on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:30:19] PROBLEM - configured eth on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:21] PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:21] PROBLEM - dhclient process on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:30:23] PROBLEM - Disk space on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:30:23] PROBLEM - ores uWSGI web app on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:23] PROBLEM - configured eth on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:25] PROBLEM - Disk space on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops [06:30:53] PROBLEM - MD RAID on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:31:01] PROBLEM - puppet last run on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:41] PROBLEM - puppet last run on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:41:55] PROBLEM - ores_workers_running on ores1001 is CRITICAL: connect to address 10.64.0.51 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:41:55] PROBLEM - ores_workers_running on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:41:59] PROBLEM - Check the NTP synchronisation status of timesyncd on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:42:49] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:51] RECOVERY - Check whether ferm is active by checking the default input chain on ores1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:42:53] RECOVERY - DPKG on ores1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:42:59] RECOVERY - configured eth on ores1001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:43:01] RECOVERY - dhclient process on ores1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:43:03] RECOVERY - Disk space on ores1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1001&var-datasource=eqiad+prometheus/ops [06:43:31] RECOVERY - MD RAID on ores1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:43:43] RECOVERY - ores_workers_running on ores1001 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:44:21] RECOVERY - Check size of conntrack table on ores1001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:46:37] RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:46:37] RECOVERY - Disk space on ores1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:46:39] RECOVERY - configured eth on ores1005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:47:21] RECOVERY - ores_workers_running on ores1005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:47:35] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:41] RECOVERY - DPKG on ores1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:47:53] RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:47:57] RECOVERY - MD RAID on ores1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:48:03] RECOVERY - dhclient process on ores1005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:48:25] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:49:07] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:12:47] RECOVERY - Check the NTP synchronisation status of timesyncd on ores1005 is OK: OK: synced at Sun 2020-01-12 07:12:46 UTC. https://wikitech.wikimedia.org/wiki/NTP [09:24:06] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:25:51] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:45:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:46:39] !log restart php on mw1238 [14:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:31] !log restart php on mw1240 [14:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:47] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:10:01] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 113464672 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:11:51] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 9552 and 101 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:55:18] 10Operations, 10Research, 10Traffic: Set up git-driven static microsite for wikiworkshop.org - https://phabricator.wikimedia.org/T242374 (10bmansurov) Status update: I've [[ https://www.mediawiki.org/w/index.php?title=Gerrit/New_repositories/Requests/Entries&diff=prev&oldid=3609873&diffmode=source | requeste... [22:16:15] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:18:03] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:51:16] PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100% [22:53:51] PROBLEM - Host cp3061 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:37] two in a row? [23:47:49] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:53:15] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Krenair) `Jan 12 22:51:15 PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100% Jan 12 22:53:51 PROBLEM - Host cp3061 is DOWN: PING CRITICAL - Packet loss = 100%...