[00:23:59] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:25:21] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 4.380 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:42:45] PROBLEM - HHVM rendering on mw1221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:44:03] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 76098 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:44:35] PROBLEM - Nginx local proxy to videoscaler on mw1310 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [00:44:41] PROBLEM - PHP7 rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:45:55] RECOVERY - Nginx local proxy to videoscaler on mw1310 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.028 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [00:45:59] RECOVERY - PHP7 rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:00:17] PROBLEM - Nginx local proxy to jobrunner on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [01:01:33] RECOVERY - Nginx local proxy to jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [01:53:25] PROBLEM - PHP7 rendering on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:54:43] RECOVERY - PHP7 rendering on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:58:26] (03CR) 10Ladsgroup: "ping :)" [puppet] - 10https://gerrit.wikimedia.org/r/511078 (https://phabricator.wikimedia.org/T113114) (owner: 10Ladsgroup) [02:22:29] PROBLEM - PHP7 rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:23:57] RECOVERY - PHP7 rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 9.179 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:42:43] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:44:07] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 5.869 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:46:17] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:47:35] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.017 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [02:50:41] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [02:51:59] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 271 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:34:39] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [04:01:47] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:37:19] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:46:07] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational [06:31:17] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:31:41] PROBLEM - puppet last run on theemin is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:58:29] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:53] RECOVERY - puppet last run on theemin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:03:13] (03PS1) 10Elukey: role::analytics_cluster::coordinator: remove port druid host [puppet] - 10https://gerrit.wikimedia.org/r/517203 [08:05:09] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: remove port druid host [puppet] - 10https://gerrit.wikimedia.org/r/517203 (owner: 10Elukey) [08:11:53] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [08:13:27] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:13:39] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:13:39] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:13:51] this is probably me [08:13:59] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:13:59] checking [08:13:59] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:14:31] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:17:23] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:17:45] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:19:07] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:19:08] yep this is definitely me, fixing a maintenance job for druid caused a problem, and aqs -> druid now is not happy [08:19:53] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [08:20:45] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [08:21:57] !log roll restart of druid brokers on druid100[4-6], stuck after regular data drop maintenance [08:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:03] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:22:19] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:22:23] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:22:35] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:22:41] what the hell [08:23:03] the data drop job shouldn't cause this mess :( [08:23:33] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:24:01] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:24:33] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:24:37] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:25:21] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:26:11] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:08:59] PROBLEM - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [09:25:41] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [09:30:01] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [10:19:43] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:21:05] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:23:08] 04Critical Alert for device cr1-eqsin.wikimedia.org - Device took too long to poll [10:25:09] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqsin.wikimedia.org recovered from Device took too long to poll [10:56:27] PROBLEM - EDAC syslog messages on wtp2020 is CRITICAL: 4 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [11:15:49] PROBLEM - EDAC syslog messages on db2084 is CRITICAL: 4.63 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db2084&var-datasource=codfw+prometheus/ops [12:22:15] 10Operations, 10DBA: db2084 temporary correctable hardware errors - https://phabricator.wikimedia.org/T225884 (10Marostegui) [12:22:33] ACKNOWLEDGEMENT - EDAC syslog messages on db2084 is CRITICAL: 4.5 ge 4 Marostegui T225884 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db2084&var-datasource=codfw+prometheus/ops [13:36:35] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [13:40:46] Hmm [14:03:51] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:20:43] !log running mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user='AKA MBG' /home/urbanecm/T225886 [14:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:45] 10Operations, 10ops-codfw: wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10jijiki) New alarms going off for this one ` [Sun Jun 16 08:30:29 2019] mce: [Hardware Error]: Machine check events logged [Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [Sun Jun... [14:22:38] ACKNOWLEDGEMENT - EDAC syslog messages on wtp2020 is CRITICAL: 4 ge 4 Effie Mouzeli Task already open for these errors - T205712 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [14:22:38] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 4.001 ge 4 Effie Mouzeli Task already open for these errors - T205712 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [14:38:24] 10Operations, 10Traffic: ATS is currently adding its own server header - https://phabricator.wikimedia.org/T224119 (10Antigng) Also, ATS doesn't change the via header as Varnish does.{F29584602} [15:04:47] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [15:21:13] PROBLEM - Device not healthy -SMART- on db2043 is CRITICAL: cluster=mysql device=cciss,2 instance=db2043:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2043&var-datasource=codfw+prometheus/ops [15:46:36] ACKNOWLEDGEMENT - HP RAID on db2043 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:3 - OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T225889 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:46:39] 10Operations, 10ops-codfw: Degraded RAID on db2043 - https://phabricator.wikimedia.org/T225889 (10ops-monitoring-bot) [15:49:38] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2043 is CRITICAL: cluster=mysql device=cciss,2 instance=db2043:9100 job=node site=codfw Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2043&var-datasource=codfw+prometheus/ops [15:50:24] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2043 - https://phabricator.wikimedia.org/T225889 (10Marostegui) a:03Papaul @Papaul can we get the disk replaced? Thanks! [15:50:34] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2043 - https://phabricator.wikimedia.org/T225889 (10Marostegui) p:05Triage→03Normal [15:50:49] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [15:59:00] RECOVERY - ensure kvm processes are running on cloudvirt1015 is OK: PROCS OK: 7 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:06:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) I'm put eight test VMs on 1015, will let them run for a few days and then see if they're still up :) [16:22:37] (03Abandoned) 10Ori.livneh: Configure forensic logging of Apache requests; enable on beta [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [19:20:51] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) >>! In T204056#5260214, @CRoslof wrote: > Domain names with [[ https://en.wikipedia.org/wiki/Country_code_top-level_domain | country code top-level doma... [20:56:32] PROBLEM - Device not healthy -SMART- on db2058 is CRITICAL: cluster=mysql device=cciss,3 instance=db2058:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2058&var-datasource=codfw+prometheus/ops [22:09:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:10:00] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:10:58] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:11:58] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:12:24] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 414, down: 0, shutdown: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:12:52] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:47:58] PROBLEM - Disk space on dbprov1001 is CRITICAL: DISK CRITICAL - free space: /srv 452151 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:13:04] PROBLEM - HP RAID on db2058 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:13:06] ACKNOWLEDGEMENT - HP RAID on db2058 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T225902 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:13:10] 10Operations, 10ops-codfw: Degraded RAID on db2058 - https://phabricator.wikimedia.org/T225902 (10ops-monitoring-bot) [23:28:24] PROBLEM - Disk space on dbprov1001 is CRITICAL: DISK CRITICAL - free space: /srv 454674 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [23:50:06] RECOVERY - Disk space on dbprov1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space