[00:33:11] (03PS4) 10Dave Pifke: Add check_prometheus rules for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) [00:35:02] (03CR) 10Dave Pifke: Add check_prometheus rules for navtiming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597176 (https://phabricator.wikimedia.org/T225739) (owner: 10Dave Pifke) [00:44:08] (03PS2) 10Dave Pifke: php: $enable_request_profiling should affect CLI [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) [00:48:24] (03CR) 10Dave Pifke: php: $enable_request_profiling should affect CLI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/599476 (https://phabricator.wikimedia.org/T253547) (owner: 10Dave Pifke) [01:19:03] (03PS6) 10Reedy: Use noc@ not webmaster@ [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) [01:59:16] (03PS2) 10Andrew Bogott: designate: allow mdns to listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/600017 (https://phabricator.wikimedia.org/T253780) [01:59:18] (03PS2) 10Andrew Bogott: pdns: add allow-axfr-ips setting for cloud auth recursors [puppet] - 10https://gerrit.wikimedia.org/r/600035 [01:59:20] (03PS1) 10Andrew Bogott: Designate: have mdns use tcp rather than udp for axfr [puppet] - 10https://gerrit.wikimedia.org/r/600095 (https://phabricator.wikimedia.org/T253780) [02:01:22] (03CR) 10Andrew Bogott: [C: 03+2] designate: allow mdns to listen on ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/600017 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [02:01:28] (03CR) 10Andrew Bogott: [C: 03+2] pdns: add allow-axfr-ips setting for cloud auth recursors [puppet] - 10https://gerrit.wikimedia.org/r/600035 (owner: 10Andrew Bogott) [02:08:09] (03CR) 10Andrew Bogott: [C: 03+2] Designate: have mdns use tcp rather than udp for axfr [puppet] - 10https://gerrit.wikimedia.org/r/600095 (https://phabricator.wikimedia.org/T253780) (owner: 10Andrew Bogott) [02:50:25] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:52:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:26:14] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 64 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:37:54] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:49:54] Hi all, was a postmort doc created for the downtime in mid-April? If so where could I read it? [05:50:50] was some time around the 12th from what I remember [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200530T0700) [07:18:33] (03CR) 10ArielGlenn: [C: 03+1] "Fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/599783 (https://phabricator.wikimedia.org/T253173) (owner: 10Jbond) [08:05:22] PROBLEM - puppet last run on an-presto1002 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:05:24] PROBLEM - puppet last run on an-presto1005 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:06:04] PROBLEM - puppet last run on an-presto1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:11:47] just re-enabled it, temporary? --^ [08:15:12] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:58] RECOVERY - puppet last run on an-presto1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:17:00] RECOVERY - puppet last run on an-presto1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:17:44] RECOVERY - puppet last run on an-presto1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:20:15] (forced the checks) [09:57:47] (03PS1) 10Elukey: Add support to pull datapoints from Kafka [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 [10:00:57] (03PS2) 10Elukey: Add support to pull datapoints from Kafka [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/600295 [12:05:37] (03PS1) 10Ladsgroup: labs: Enable GrowthExperiments in Persian Wikipedia (beta only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600319 (https://phabricator.wikimedia.org/T253291) [12:16:08] (03PS2) 10Ladsgroup: labs: Enable GrowthExperiments in Persian Wikipedia (beta only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600319 (https://phabricator.wikimedia.org/T253291) [12:18:43] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600319 (https://phabricator.wikimedia.org/T253291) (owner: 10Ladsgroup) [12:19:28] (03Merged) 10jenkins-bot: labs: Enable GrowthExperiments in Persian Wikipedia (beta only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600319 (https://phabricator.wikimedia.org/T253291) (owner: 10Ladsgroup) [12:54:25] (03PS1) 10Ladsgroup: beta: Add Persian to suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/600335 (https://phabricator.wikimedia.org/T253291) [18:40:42] (03PS1) 10AntiCompositeNumber: engine.imagemagick: Catch pyexiv2.ExifValueError also [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/600400 (https://phabricator.wikimedia.org/T193326) [18:41:09] (03CR) 10jerkins-bot: [V: 04-1] engine.imagemagick: Catch pyexiv2.ExifValueError also [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/600400 (https://phabricator.wikimedia.org/T193326) (owner: 10AntiCompositeNumber) [20:16:50] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [20:30:50] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:36:36] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:00:26] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:00:42] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:12] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:01:18] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:02:18] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:03:02] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:03:08] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:07:48] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:08:32] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:08:38] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:19:00] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1030 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:26:22] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:49:50] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1030 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm