[00:16:40] (03PS1) 10Jeena Huneidi: Set correct appbase url port [deployment-charts] - 10https://gerrit.wikimedia.org/r/539641 [00:18:34] (03PS2) 10Jeena Huneidi: Set correct appbase url port [deployment-charts] - 10https://gerrit.wikimedia.org/r/539641 [00:26:31] 10Operations, 10Wikimedia-Mailing-lists, 10Wikispore: Creation of Wikispore mailing list - https://phabricator.wikimedia.org/T232961 (10Peachey88) [01:54:09] (03PS5) 10Krinkle: Revert "Disable MessageBlobStore::clear() via hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [01:54:27] (03CR) 10Krinkle: [C: 03+1] "Yep." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508476 (https://phabricator.wikimedia.org/T222539) (owner: 10Catrope) [02:40:26] (03CR) 10SBassett: [C: 04-1] "Needs manual rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519479 (https://phabricator.wikimedia.org/T226733) (owner: 10SBassett) [03:23:17] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:24:49] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:16:03] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:11] PROBLEM - Device not healthy -SMART- on db1070 is CRITICAL: cluster=mysql device=megaraid,7 instance=db1070:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1070&var-datasource=eqiad+prometheus/ops [04:53:40] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Vgutierrez) >>! In T210411#5531164, @Dzahn wrote: >>>! In T210411#5496180, @Vgutierrez wrote: >> Please note that the docker-registry certificate is missing the public hos... [04:56:45] RECOVERY - Memory correctable errors -EDAC- on elastic1029 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad+prometheus/ops [05:00:03] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1070 is CRITICAL: cluster=mysql device=megaraid,7 instance=db1070:9100 job=node site=eqiad Marostegui T208323 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1070&var-datasource=eqiad+prometheus/ops [05:00:58] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:01:05] 10Operations, 10ops-eqiad, 10DBA: db1070 (s5 master) SMART-reported impending drive failure - https://phabricator.wikimedia.org/T234115 (10CDanis) [05:02:43] 10Operations, 10ops-eqiad, 10DBA: db1070 (s5 master) SMART-reported impending drive failure - https://phabricator.wikimedia.org/T234115 (10CDanis) [05:02:47] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10CDanis) [09:58:25] PROBLEM - Check systemd state on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:37] PROBLEM - MD RAID on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:59:01] PROBLEM - DPKG on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:59:29] PROBLEM - Check whether ferm is active by checking the default input chain on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:59:53] PROBLEM - puppet last run on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:00:07] PROBLEM - SSH on stat1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:00:15] PROBLEM - Check size of conntrack table on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:00:33] PROBLEM - Disk space on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [10:01:11] PROBLEM - dhclient process on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [10:01:19] PROBLEM - configured eth on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [10:07:59] RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:11:49] RECOVERY - Disk space on stat1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [10:11:55] RECOVERY - DPKG on stat1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:12:23] RECOVERY - Check whether ferm is active by checking the default input chain on stat1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:12:29] RECOVERY - dhclient process on stat1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [10:12:37] RECOVERY - configured eth on stat1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [10:12:57] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:09] RECOVERY - Check size of conntrack table on stat1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [10:13:09] RECOVERY - MD RAID on stat1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:16:41] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:17:35] (03PS1) 10Urbanecm: New throttle rule for Czech wiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539661 (https://phabricator.wikimedia.org/T234113) [11:44:55] 10Operations, 10Traffic: Unable to connect to Wikimedia sites from Iran? - https://phabricator.wikimedia.org/T234123 (10MarcoAurelio) [12:03:03] 10Operations, 10Traffic, 10netops: Unable to connect to Wikimedia sites from Iran? - https://phabricator.wikimedia.org/T234123 (10MarcoAurelio) [12:03:49] PROBLEM - configured eth on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:04:11] PROBLEM - Check systemd state on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:19] PROBLEM - Check size of conntrack table on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:04:25] PROBLEM - MD RAID on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:04:35] PROBLEM - Disk space on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [12:04:47] PROBLEM - DPKG on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:05:09] PROBLEM - Check whether ferm is active by checking the default input chain on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:05:17] PROBLEM - dhclient process on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [12:06:17] PROBLEM - puppet last run on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:10:30] 10Operations, 10Traffic, 10netops: Unable to connect to Wikimedia sites from Iran? - https://phabricator.wikimedia.org/T234123 (10Arian_Ar) Starting from 10:30 UTC, @MohammadtheEditor , @Mardetanha and i received disruption reports from some of our users in Iran. It Varies from ISP to ISP, but more ISPs are... [12:20:35] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [12:29:49] PROBLEM - SSH on stat1004 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:33:05] RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:35:11] 10Operations, 10Traffic, 10netops: Unable to connect to Wikimedia sites from Iran? - https://phabricator.wikimedia.org/T234123 (10MohammadtheEditor) p:05Triage→03Low Here is what we know about the issue: 1. The issue has been mainly around DSL ISPs, amongst three major mobile Internet providers in Iran,... [12:41:47] PROBLEM - IPMI Sensor Status on stat1004 is CRITICAL: connect to address 10.64.5.104 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:52:29] PROBLEM - SSH on stat1004 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:54:11] RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:46:51] mmm no metrics from stat1004 in the past hour https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&refresh=5m&orgId=1 [13:49:03] trying to login via mgmt serial [13:49:16] but it is really unusable [13:54:35] RECOVERY - Check whether ferm is active by checking the default input chain on stat1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:54:41] RECOVERY - dhclient process on stat1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [13:54:51] RECOVERY - configured eth on stat1004 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:55:13] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:19] RECOVERY - Check size of conntrack table on stat1004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [13:55:27] RECOVERY - MD RAID on stat1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [13:55:37] RECOVERY - Disk space on stat1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [13:55:44] killed a huge python script :) [13:55:49] RECOVERY - DPKG on stat1004 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:58:21] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:13:39] RECOVERY - IPMI Sensor Status on stat1004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:19:07] elukey: huge snake ate the server? :( [14:19:21] poor stat1004 [14:22:59] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1004 is OK: OK: synced at Sat 2019-09-28 14:22:58 UTC. https://wikitech.wikimedia.org/wiki/NTP [15:44:54] vgutierrez: yeah exactly :D [16:26:01] PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.status is 7246 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [16:26:17] uh? [16:26:19] * vgutierrez checking [16:26:51] PROBLEM - Ensure cert-sync script runs successfully in the active node on acmechief1001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.done is 7294 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [16:28:49] !log restarting acme-chief on acmechief1001 [16:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:03] RECOVERY - Ensure cert-sync script runs successfully in the active node on acmechief1001 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.done is 9 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [16:30:23] *sigh* [16:30:55] RECOVERY - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is OK: FILE_AGE OK: /var/lib/acme-chief/certs/.rsync.status is 59 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief [16:35:20] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) [16:35:56] 10Operations, 10Acme-chief, 10Traffic: Memory leak on acme-chief 0.21 - https://phabricator.wikimedia.org/T234131 (10Vgutierrez) p:05Triage→03High [16:41:59] even more work... \o/ [17:52:48] (03PS1) 10Daimona Eaytoy: Use AbuseFilterCachingParser for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) [17:59:05] (03CR) 10Krinkle: [C: 03+1] Use AbuseFilterCachingParser for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/539674 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [18:43:47] 10Operations, 10Traffic, 10netops: Unable to connect to Wikimedia sites from Iran? - https://phabricator.wikimedia.org/T234123 (10Ladsgroup) Hadoop says there was a big reduction in page views in 10AM UTC and it went back to normal next hour: ` 0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> SELECT . . . . .... [19:02:02] 10Operations, 10Traffic, 10netops: Unable to connect to Wikimedia sites from Iran? - https://phabricator.wikimedia.org/T234123 (10MarcoAurelio) 05Open→03Resolved Per T234123#5531842. [19:06:00] (03PS1) 10MarcoAurelio: gerrit: Fix renamed group name "Project and Group Creators" [puppet] - 10https://gerrit.wikimedia.org/r/539676 [19:07:02] (03PS2) 10MarcoAurelio: gerrit: Fix renamed group name "Project and Group Creators" [puppet] - 10https://gerrit.wikimedia.org/r/539676 [19:07:04] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Fix renamed group name "Project and Group Creators" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [19:10:08] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [19:12:20] (03CR) 10MarcoAurelio: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/275/" [puppet] - 10https://gerrit.wikimedia.org/r/539676 (owner: 10MarcoAurelio) [19:17:03] (03CR) 10Umherirrender: [C: 04-1] "We can start with "periodical", the messages are prepared in the WikimediaMessages extension (I29f61149f2c400b91acea639b6282dddd1b6552f)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530871 (https://phabricator.wikimedia.org/T78711) (owner: 10Umherirrender) [19:55:23] (03CR) 10MarcoAurelio: "It looks 10107cf5 isn't correctly linked in the commit message although it is an existing commit for All-Projects:refs/meta/config: RECOVERY - Disk space on notebook1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=notebook1003&var-datasource=eqiad+prometheus/ops [21:21:01] (03PS1) 10Alex Monk: Remove some old Trusty stuff [puppet] - 10https://gerrit.wikimedia.org/r/539681 [21:23:23] (03PS2) 10Alex Monk: Remove some old Trusty/Jessie stuff [puppet] - 10https://gerrit.wikimedia.org/r/539681 [21:26:19] (03PS3) 10Alex Monk: Remove some old Trusty/Jessie stuff [puppet] - 10https://gerrit.wikimedia.org/r/539681 [21:55:54] (03PS1) 10Alex Monk: Remove old Toolforge Clush master files [puppet] - 10https://gerrit.wikimedia.org/r/539685 [21:57:29] (03CR) 10Alex Monk: "John Bond also noticed this in I33b8f615" [puppet] - 10https://gerrit.wikimedia.org/r/539685 (owner: 10Alex Monk) [21:59:32] (03PS2) 10Alex Monk: Remove old Toolforge Clush master files [puppet] - 10https://gerrit.wikimedia.org/r/539685 [23:04:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 46 probes of 461 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [23:10:35] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 28 probes of 461 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts