[00:04:39] (03PS1) 10Dzahn: switch labs-project-phabricator to labs-project-devtools [puppet] - 10https://gerrit.wikimedia.org/r/565703 [00:11:40] 10Operations, 10observability, 10Patch-For-Review: StatsD Exporter drops relayed metrics - https://phabricator.wikimedia.org/T239833 (10colewhite) [00:13:43] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [00:14:41] 10Operations, 10observability, 10Patch-For-Review: StatsD Exporter drops relayed metrics - https://phabricator.wikimedia.org/T239833 (10colewhite) The latest patch appears to help a lot. There is still a discrepancy though that I haven't been able to track down. ` $ touch forwarded_new.txt && socat -t 0 FIL... [00:18:58] hi -ops [00:19:09] Would anyone care a bout a message from gfiber-noc@google.com [00:19:37] will send to ops@wikimedia.org [00:19:46] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Dzahn) [00:20:30] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) @jijiki the racking proposal will work for all the other servers but not those in rack C3 and A8. If we want to rack servers in C3 we will have to decom some servers since it... [00:21:14] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) [00:21:43] (03CR) 10Dzahn: [C: 03+2] admin: add Rudolph Ampofo to ldap_only_admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/565694 (https://phabricator.wikimedia.org/T243103) (owner: 10Dzahn) [00:21:58] 10Operations, 10ops-codfw: codfw: rack/setup/install parse200[1-20].codfw.wmnet - https://phabricator.wikimedia.org/T243112 (10Papaul) p:05Triage→03Normal [00:23:45] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Dzahn) 05Open→03Resolved a:03Dzahn @rudolph-san @Iflorez Access has been granted. Logging into superset should... [00:25:59] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) a:05Joe→03Papaul [00:26:18] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [00:26:55] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [00:27:44] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) [00:29:12] 10Operations, 10ops-codfw, 10serviceops: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Papaul) @jijiki in order to rack 32 servers in row C3 and C4 we will have to decom first servers in those racks. [00:30:38] (03PS1) 10Dzahn: install_server: update MAC address of gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565708 (https://phabricator.wikimedia.org/T239151) [00:41:54] (03CR) 10Dzahn: [C: 03+2] install_server: update MAC address of gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565708 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [00:42:15] (03PS2) 10Dzahn: install_server: update MAC address of gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565708 (https://phabricator.wikimedia.org/T239151) [00:45:51] (03PS1) 10Paladox: Phabricator: Make manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 [00:49:00] (03PS2) 10Paladox: Phabricator: Make manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 [00:51:54] (03PS3) 10Paladox: Phabricator: Make manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 [00:54:43] (03PS4) 10Dzahn: Phabricator: Make scap's manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [00:55:37] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/20437/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [00:56:28] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/20439/phabricator-prod-1001.devtools.eqiad.wmflabs/" [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [00:57:43] (03PS5) 10Paladox: Phabricator: Make scap's manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 [00:59:03] (03PS6) 10Paladox: Phabricator: Make scap's manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 [01:00:53] (03PS7) 10Paladox: Phabricator: Make scap's manage_user configurable [puppet] - 10https://gerrit.wikimedia.org/r/565712 [01:04:06] (03CR) 10Dzahn: [C: 04-1] "unfortunately https://puppet-compiler.wmflabs.org/compiler1002/20443/phabricator-prod-1001.devtools.eqiad.wmflabs/change.phabricator-prod-" [puppet] - 10https://gerrit.wikimedia.org/r/565712 (owner: 10Paladox) [01:06:52] (03PS1) 10Paladox: Scap: Fix target to be able to set manage_user to false [puppet] - 10https://gerrit.wikimedia.org/r/565713 [01:09:18] (03PS2) 10Paladox: Scap: Fix target to be able to set manage_user to false [puppet] - 10https://gerrit.wikimedia.org/r/565713 [01:09:52] (03PS3) 10Paladox: Scap: Fix target to be able to set manage_user to false [puppet] - 10https://gerrit.wikimedia.org/r/565713 [01:12:43] (03PS1) 10Dzahn: gerrit: set gerrit host name and server list for gerrit1002/gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/565715 (https://phabricator.wikimedia.org/T239151) [01:14:06] (03CR) 10Paladox: [C: 03+1] gerrit: set gerrit host name and server list for gerrit1002/gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/565715 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [01:14:52] (03CR) 10Dzahn: [C: 03+2] gerrit: set gerrit host name and server list for gerrit1002/gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/565715 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [01:15:03] (03PS2) 10Dzahn: gerrit: set gerrit host name and server list for gerrit1002/gerrit-test [puppet] - 10https://gerrit.wikimedia.org/r/565715 (https://phabricator.wikimedia.org/T239151) [01:17:33] (03CR) 10Paladox: ferm_misc/db: allow connections from gerrit-test in ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [01:22:07] (03PS1) 10Dzahn: acme_chief/gerrit: remove gerrit-new, add gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565716 (https://phabricator.wikimedia.org/T239151) [01:24:42] (03CR) 10Dzahn: ferm_misc/db: allow connections from gerrit-test in ferm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [01:25:00] (03PS2) 10Dzahn: ferm_misc/db: allow connections from gerrit1002 in ferm [puppet] - 10https://gerrit.wikimedia.org/r/562965 (https://phabricator.wikimedia.org/T239151) [01:25:24] (03CR) 10Paladox: [C: 03+1] acme_chief/gerrit: remove gerrit-new, add gerrit1002 [puppet] - 10https://gerrit.wikimedia.org/r/565716 (https://phabricator.wikimedia.org/T239151) (owner: 10Dzahn) [01:33:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Iflorez) Hoorah! Thank you @Dzahn! [01:34:41] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Request for LDAP access to the WMF group for Rudolph Ampofo - https://phabricator.wikimedia.org/T243103 (10Iflorez) >>! In T243103#5814229, @Aklapper wrote: >>>! In T243103#5813834, @Iflorez wrote: >> We were told that for Superset access he needed... [01:38:44] (03CR) 10Alex Monk: "I wasn't aware we installed keystone anywhere inside any Cloud VPS machines, but okay, we can add some defaults I guess" [puppet] - 10https://gerrit.wikimedia.org/r/565431 (https://phabricator.wikimedia.org/T229441) (owner: 10Alex Monk) [01:41:48] (03PS3) 10Alex Monk: CloudVPS: codfw1dev: Fix default SSH rule to use correct range [puppet] - 10https://gerrit.wikimedia.org/r/565431 (https://phabricator.wikimedia.org/T229441) [02:06:20] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:18:00] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:41:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [02:46:48] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:07:58] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:13:48] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:16:58] PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100% [03:18:32] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 36 probes of 515 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:24:24] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 34 probes of 515 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:32:56] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:38:44] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [04:15:54] !log cp3065.mgmt: /admin1-> racadm serveraction hardreset T238305 [04:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:59] T238305: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 [04:18:18] RECOVERY - Host cp3065 is UP: PING OK - Packet loss = 0%, RTA = 83.36 ms [04:21:59] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10CDanis) `03:16:58 <+icinga-wm> PROBLEM - Host cp3065 is DOWN: PING CRITICAL - Packet loss = 100%` Nothing in `racadm getsel` or `racadm lclog view` (latter just has me logging in over SSH). [05:17:16] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [05:23:06] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [06:29:14] PROBLEM - MD RAID on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:29:22] PROBLEM - DPKG on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:29:42] PROBLEM - Disk space on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [06:29:50] PROBLEM - dhclient process on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:29:52] PROBLEM - puppet last run on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:29:52] PROBLEM - Check whether ferm is active by checking the default input chain on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:30:10] PROBLEM - configured eth on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:30:32] PROBLEM - Check size of conntrack table on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:30:34] PROBLEM - ores uWSGI web app on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:30:58] PROBLEM - Check systemd state on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:44] PROBLEM - ores_workers_running on ores1005 is CRITICAL: connect to address 10.64.32.14 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/ORES [06:46:18] RECOVERY - dhclient process on ores1005 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:46:19] RECOVERY - Check whether ferm is active by checking the default input chain on ores1005 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:46:24] RECOVERY - ores_workers_running on ores1005 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [06:46:36] RECOVERY - configured eth on ores1005 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:46:58] RECOVERY - Check size of conntrack table on ores1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:47:22] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:47:26] RECOVERY - Check systemd state on ores1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:30] RECOVERY - MD RAID on ores1005 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:47:40] RECOVERY - DPKG on ores1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:47:58] RECOVERY - Disk space on ores1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1005&var-datasource=eqiad+prometheus/ops [07:05:44] !log Remove partitions from enwiki.revision on db2085 T239453 [07:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:47] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [09:00:29] !log repool wdqs1007 (T242453) [09:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:33] T242453: wdqs1005 stopped to handle updates - https://phabricator.wikimedia.org/T242453 [10:19:42] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:20:21] (03PS1) 10Zoranzoki21: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) [10:21:31] (03CR) 10jerkins-bot: [V: 04-1] Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) (owner: 10Zoranzoki21) [10:21:50] (03PS2) 10Zoranzoki21: Add vzg-easydb.gbv.de to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565723 (https://phabricator.wikimedia.org/T243118) [10:25:30] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:40:50] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:46:40] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:02:00] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:19:30] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:28:34] (03PS1) 10Majavah: Add logos for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) [11:28:36] (03PS1) 10Majavah: Configure logo for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565725 (https://phabricator.wikimedia.org/T242416) [11:29:06] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:30:00] (03CR) 10jerkins-bot: [V: 04-1] Configure logo for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565725 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [11:35:41] !log upgraded spicerack to 0.0.29 on cumin hosts [11:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:42] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:03:42] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:15:20] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:36:28] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 37 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:42:18] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:51:50] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 37 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:15:08] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:24:36] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:30:26] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 511 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:12:48] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:24:28] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:43:02] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:00:54] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 37 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:12:34] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:42:26] interesting, esams ipv6 connectivity did get worse in the past ~24h [15:48:43] 10Operations, 10Traffic, 10netops: esams ipv6 reachability degraded - https://phabricator.wikimedia.org/T243127 (10CDanis) [15:48:48] 10Operations, 10Traffic, 10netops: esams ipv6 reachability degraded - https://phabricator.wikimedia.org/T243127 (10CDanis) p:05Triage→03Normal [16:10:20] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:16:10] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:25:39] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:31:28] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:33:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:38:54] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:42:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [16:49:50] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:50:30] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:55:18] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:02:08] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:17:30] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:23:20] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:32:50] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [17:38:42] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 35 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:05:38] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 36 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:34:46] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 34 probes of 510 (alerts on 35) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:36:12] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:41:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:47:14] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:50:48] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.267e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [18:58:08] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:20:54] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:21:06] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:52:02] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:52:14] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:49:38] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.061e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:56:59] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [22:22:06] PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% [22:41:12] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.15e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [22:50:04] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [22:57:46] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw