[00:01:42] <wikibugs>	 (03PS1) 10DannyS712: Clean up abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965)
[00:03:52] <wikibugs>	 (03PS2) 10DannyS712: Clean up abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965)
[00:04:39] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Amanda Bittaker - https://phabricator.wikimedia.org/T238705 (10Dzahn) There are 2 LDAP users using Amanda's WMF email address: abittaker (uid 22529) and wubwubwub (uid 11703).
[00:06:40] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Amanda Bittaker - https://phabricator.wikimedia.org/T238705 (10Dzahn) 05Open→03Resolved a:05elukey→03Dzahn The wubwubwub user is already a member of the WMF group and matches the existing shell user.  Looks like no change is needed....
[00:06:55] <wikibugs>	 (03PS3) 10DannyS712: Clean up abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965)
[00:11:02] <wikibugs>	 (03PS4) 10DannyS712: Clean up abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965)
[00:15:23] <wikibugs>	 (03PS5) 10DannyS712: Clean up abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965)
[00:19:42] <wikibugs>	 (03PS1) 10Dzahn: analytics/admins: create admin group and for for airflor, apply on an-airflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905)
[00:22:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] analytics/admins: create admin group and for for airflor, apply on an-airflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[00:25:31] <wikibugs>	 10Operations, 10Discovery-Search, 10SRE-Access-Requests, 10Patch-For-Review: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905 (10Dzahn) >  probably a little bit more clean from the user perms point of view.  Yes please, let's create a new admin gr...
[00:25:42] <wikibugs>	 (03PS6) 10DannyS712: Clean up abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965)
[00:28:25] <wikibugs>	 (03PS3) 10Dzahn: admin: Remove myself (MaxSem) [puppet] - 10https://gerrit.wikimedia.org/r/552389 (https://phabricator.wikimedia.org/T238960) (owner: 10MaxSem)
[00:29:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admin: Remove myself (MaxSem) [puppet] - 10https://gerrit.wikimedia.org/r/552389 (https://phabricator.wikimedia.org/T238960) (owner: 10MaxSem)
[00:29:48] <wikibugs>	 (03PS2) 10Dzahn: analytics/admins: create admin group and role for airflow [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905)
[00:32:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] analytics/admins: create admin group and role for airflow [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[00:32:29] <wikibugs>	 (03PS7) 10DannyS712: Clean up abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552610 (https://phabricator.wikimedia.org/T238965)
[00:36:21] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[00:36:45] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Dzahn) Removed Max from the "wmf" LDAP group and WMF-NDA in Phab. Will re-add as volunteer in "nda" group.
[00:39:47] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[00:41:38] <wikibugs>	 (03PS3) 10Dzahn: analytics/admins: create admin group and for for airflor, apply on an-airflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905)
[00:44:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] analytics/admins: create admin group and for for airflor, apply on an-airflow1001 [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[00:45:41] <wikibugs>	 (03PS4) 10Dzahn: analytics/admins: create admin group and role for airflow [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905)
[00:50:43] <wikibugs>	 (03PS1) 10DannyS712: Allow enwikiversity interface admins to remove their own interface administratorship [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552615 (https://phabricator.wikimedia.org/T238967)
[00:52:04] <wikibugs>	 (03PS2) 10DannyS712: Allow enwikiversity interface admins to remove their own interface administratorship [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552615 (https://phabricator.wikimedia.org/T238967)
[00:52:35] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/19572/an-airflow1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/552613 (https://phabricator.wikimedia.org/T238905) (owner: 10Dzahn)
[01:17:35] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[01:22:25] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10aezell) As Max's former manager, I endorse his access to these resources under and NDA.
[02:52:21] <wikibugs>	 (03PS10) 10Andrew Bogott: wmf_sink: Prepare to delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708)
[02:55:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: Prepare to delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott)
[03:52:29] <wikibugs>	 (03PS2) 10Andrew Bogott: wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708)
[03:52:31] <wikibugs>	 (03PS1) 10Andrew Bogott: wmf-sink: include python-git package [puppet] - 10https://gerrit.wikimedia.org/r/552619 (https://phabricator.wikimedia.org/T238708)
[03:54:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] wmf-sink: include python-git package [puppet] - 10https://gerrit.wikimedia.org/r/552619 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott)
[05:06:19] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[05:32:35] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[05:34:13] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[05:45:41] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[05:49:05] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.008333 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[06:08:19] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:09:07] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:13] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 44.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[06:52:21] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 72.06 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[06:59:09] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:05] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:02:37] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 52.06 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[07:04:47] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (210933s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:06:01] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 89.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[07:20:12] <wikibugs>	 (03PS4) 10ArielGlenn: move misc crons to dumpsdata1002 nfs server [puppet] - 10https://gerrit.wikimedia.org/r/551804 (https://phabricator.wikimedia.org/T224563)
[07:24:52] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] move misc crons to dumpsdata1002 nfs server [puppet] - 10https://gerrit.wikimedia.org/r/551804 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn)
[07:34:33] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (213372s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:34:47] <apergos>	 you can ignore any whines from snapshot1008 
[07:34:57] <apergos>	 I'm shuffling things around over there (puppet whines that is)
[07:43:37] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 48.25 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[07:49:20] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.63 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[08:08:52] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.43 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[08:16:14] <wikibugs>	 10Operations, 10Dumps-Generation, 10Patch-For-Review: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) snapshot1008 now uses dumpsdata1002 as its nfs server. I had to manually systemctl stop nfs-mountd.service and start it again for dumpsdata1002 to pick up...
[08:19:59] <wikibugs>	 10Operations, 10Dumps-Generation, 10Patch-For-Review: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) And some of them are already on labstore1006, so rsyncs are working as expected.
[08:23:23] <apergos>	 🎶everyone's deploying on the weekend 🎶 everybody loves the puppet dance...
[08:23:45] <apergos>	 but that's when there's a free spot in between wikidata entity dump runs, so whaddya gonna do
[08:23:52] * apergos checks back out again
[08:28:24] <icinga-wm>	 PROBLEM - SSH on db2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:31:38] <icinga-wm>	 RECOVERY - SSH on db2125 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:38:40] <icinga-wm>	 PROBLEM - SSH on db2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:41:58] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[08:43:38] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[08:55:58] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 57.59 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[08:58:08] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:01:06] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 75.91 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:01:30] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 75273 bytes in 5.328 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:02:16] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 47 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:02:17] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:02:54] <vgutierrez>	 uh
[09:03:48] <godog>	 I'm here too
[09:03:52] <_joe_>	 yeah not sure what's going on
[09:03:57] <vgutierrez>	 I don't have access to my laptop right now
[09:03:57] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15461 bytes in 4.933 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:04:09] <vgutierrez>	 net issues?
[09:04:18] <apergos>	 huh
[09:04:24] <apergos>	 worlds fastest recovery
[09:04:26] <_joe_>	 I see ripe atlas alerts indeed
[09:05:08] <vgutierrez>	 yep
[09:05:11] <_joe_>	 if it happens again, I'll consider depooling
[09:05:12] <godog>	 indeed, I see no big drop in traffic in ulsfo
[09:05:22] <vgutierrez>	 but it's weird that we don't see upload alerting there
[09:05:45] <vgutierrez>	 a connectivity issue should affect both clusters
[09:05:50] * volans not at my laptop but can be in a few if needed
[09:06:10] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 60 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:07:08] <godog>	 I'm checking librenms
[09:08:43] <godog>	 can't find anything obvious in the event log
[09:11:12] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 77 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:11:34] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with BadStatusLine: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:13:27] <godog>	 mhh I'm wondering if the ripe atlas failures are failing on the atlas side too, e.g. badstatusline
[09:17:05] <godog>	 so yeah ripe atlas seems unhappy, but I'm failing to find impact on traffic on the dashboards if sth is going on
[09:17:32] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 67 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:20:14] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 62 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:22:02] <XioNoX>	 the icinga checks for the lvs has been failing regularly for a few days at least, but recovering before it pages
[09:22:26] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 85 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:22:50] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 31 probes of 492 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:23:31] <godog>	 XioNoX: ack, thanks
[09:26:15] <XioNoX>	 I'm around for the next 5h, layover in Munich
[09:26:50] <godog>	 for the curious: https://logstash.wikimedia.org/goto/0f345e27cb6d2781350b1ed091e3c10e
[09:27:12] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with BadStatusLine: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:28:03] <XioNoX>	 godog, that's what I'm talking about https://usercontent.irccloud-cdn.com/file/hvUPNguD/Screenshot_20191123-102718.png
[09:28:33] <XioNoX>	 (on my phone, trying to find a coffee)
[09:29:14] <godog>	 XioNoX: yeah I'm seeing the same from logstash
[09:33:05] <godog>	 for icinga lvs paging checks only: https://logstash.wikimedia.org/goto/05bfea23d6294a4fe8a3e576f3f52620
[09:34:48] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 65 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:35:25] <godog>	 unsurprisingly correlates with appservers latency rising
[09:36:06] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with URLError: urlopen error [Errno 0] Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:36:36] <godog>	 this that is: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1574409546138&to=1574501773704&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&panelId=10&fullscreen
[09:39:54] <godog>	 ok I don't think we're in immediate danger but appservers get latency is a bit worrying, might also trigger pages again if that's the root cause
[09:40:14] <godog>	 thoughts ?
[09:40:45] <XioNoX>	 still on my phone, but I agree
[09:40:49] <XioNoX>	 worth a task
[09:41:00] <XioNoX>	 and if it pages again then investigate more?
[09:41:14] <godog>	 sounds good, opening the task now
[09:41:50] <XioNoX>	 the ripe atlas ones look like they are doing maintenance on the infra? or having some issues?
[09:44:14] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with BadStatusLine: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:44:15] <godog>	 yeah looks like that to me too
[09:45:54] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with URLError: urlopen error [Errno 0] Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:47:13] <XioNoX>	 alright, on my laptop, looks like Munich airport doesn't know how to operate a DHCP server
[09:47:18] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with URLError: urlopen error [Errno 0] Error https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:47:41] <XioNoX>	 godog: I'm downtiming the ripe alerts until monday
[09:48:12] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 57 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:48:24] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[09:48:48] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.94 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:49:11] <wikibugs>	 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10fgiunchedi)
[09:49:17] <godog>	 XioNoX: sounds good, thank you
[09:49:24] <godog>	 ^ that the task
[09:49:52] <XioNoX>	 !log downtime all ripe-atlas checks until Monday (most likely an upstream issue/maintenance)
[09:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:16] <XioNoX>	 godog: which task?
[09:51:07] <godog>	 XioNoX: T238973
[09:51:07] <stashbot>	 T238973: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973
[09:51:50] <XioNoX>	 godog: those big swing up and down look related? https://grafana.wikimedia.org/d/000000180/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:52:03] <XioNoX>	 only codfw
[09:53:26] <godog>	 XioNoX: possible yeah, goes back three days at least
[09:53:30] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[09:54:18] <wikibugs>	 10Operations, 10serviceops: Appservers rising GET latency might have triggered LVS pages - https://phabricator.wikimedia.org/T238973 (10fgiunchedi)
[09:54:24] <wikibugs>	 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10fgiunchedi) Found this task only now, but see also {T238973}
[09:55:32] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 2 probes of 546 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:55:38] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 76 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[09:56:08] <godog>	 ok I'm going afk
[09:58:00] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 2 probes of 546 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[09:58:36] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[10:01:10] <XioNoX>	 godog: ok! I'll be around for ~4h
[10:08:50] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[10:27:38] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[10:27:44] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 29 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:30:30] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 26 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:31:30] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 26 probes of 492 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[10:34:28] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[10:41:14] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[10:56:38] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[10:59:13] <wikibugs>	 (03PS1) 10Aklapper: Phabricator: Rename Priority field value "Normal" to "Medium" [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757)
[11:00:04] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:03:13] <wikibugs>	 (03PS1) 10MarcoAurelio: Add .gitreview [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/552627
[11:03:32] <wikibugs>	 (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] Add .gitreview [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/552627 (owner: 10MarcoAurelio)
[11:04:45] <revi>	 I thought my net was lagging but someone else is saying they're lagging (both in KR) so just noting it for the record in case it's not just me (and them):-p
[11:05:12] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:05:33] <wikibugs>	 (03CR) 10Aklapper: [C: 04-1] "Temporarily setting -1 because I'd like to give a last heads-up / call for feedback on wikitech-l@ about this." [puppet] - 10https://gerrit.wikimedia.org/r/552626 (https://phabricator.wikimedia.org/T228757) (owner: 10Aklapper)
[11:11:58] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:22:14] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:30:48] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:33:28] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6958 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[11:35:00] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1332 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:36:40] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1332 is OK: HTTP OK: HTTP/1.1 200 OK - 75173 bytes in 8.421 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[11:37:07] <AlanM1>	 Is there something going on on enwiki? Really slow, especially with edit saves.
[11:40:16] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[11:40:58] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:46:08] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:47:51] <Bsadowski1>	 revi: I've been lagging too
[11:47:52] <_joe_>	 !log restarting php7.2-fpm on mw1329
[11:47:53] <Bsadowski1>	 :p
[11:47:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:37] <Bsadowski1>	 I noticed it last night, revi.
[11:48:44] <Bsadowski1>	 8 hours ago
[11:51:16] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[11:56:55] <_joe_>	 !log oblivian@cumin1001:~$ sudo cumin -b2 -s60 A:mw-eqiad 'restart-php7.2-fpm'
[11:56:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:34] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[12:01:50] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1332 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1312 bytes in 4.666 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:03:28] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1332 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:03:38] <wikibugs>	 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10Mathis_Benguigui) Hi, shouldn't this task be in //unbreak now!// priority?
[12:05:38] <wikibugs>	 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10Joe) >>! In T238939#5686415, @Mathis_Benguigui wrote: > Hi, shouldn't this task be in //Unbreak now!// priority?  Probably, given I'm investigating on Saturday. But I think ther...
[12:08:32] <wikibugs>	 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10Mathis_Benguigui) p:05Triage→03Unbreak!
[12:11:59] <wikibugs>	 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10Joe) Just to clarify - the situation got worrisome only this morning, when latencies skyrocketed and the issue became user-visible. I'm not sure the two issue are the same, but...
[12:15:12] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[12:16:51] <_joe_>	 Bsadowski1, revi are things any better?
[12:17:12] <revi>	 lemme see...
[12:17:33] <_joe_>	 the average response times of the backend did cut to one-third, so I have some hope 
[12:17:51] <revi>	 much better it seems
[12:18:20] <revi>	 (and +1 from my friend
[12:18:38] <_joe_>	 the dear old "restart the service, see if that solves it"
[12:18:42] <_joe_>	 I hate doing it
[12:18:57] <_joe_>	 but it's saturday and I have a life to go back to :)
[12:19:04] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[12:23:28] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[12:24:29] <wikibugs>	 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10Joe) 05Open→03Resolved a:03Joe Restarting php-fpm on the affected servers did solve the issue. I decided against doing deeper debugging before restarting the fleet because...
[12:29:18] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 89.51 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[13:59:59] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Joseagush) Well, now all templates (Mal) said "Galat script: no module." What just happen? Nevertheless, It wor...
[14:12:06] <wikibugs>	 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Ladsgroup) 05Open→03Resolved Given T234384#5540226 by @Joe: > FTR, we did remove hhvm from production meani...
[14:23:41] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: touch (duration: 00m 55s)
[14:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:25] <wikibugs>	 (03PS1) 10Alex Monk: profile::url_downloader: Add missing labs neutron subnet, also link-local [puppet] - 10https://gerrit.wikimedia.org/r/552631
[14:46:58] <wikibugs>	 (03PS2) 10Alex Monk: profile::url_downloader: Add missing labs neutron subnet, also link-local [puppet] - 10https://gerrit.wikimedia.org/r/552631
[14:55:22] <wikibugs>	 (03CR) 10BBlack: [C: 04-1] Switch from X-Real-IP to X-Client-IP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris)
[15:18:18] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.57 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:26:50] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:23:16] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.02 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:28:22] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.88 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:11:31] <Zoranzoki21>	 Hi, can anyone check what's happening with postmerge of operations/mediawiki-config https://integration.wikimedia.org/zuul/
[17:13:36] <Zoranzoki21>	 James_F, Reedy: ?
[17:17:12] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[17:25:58] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[17:29:22] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[17:34:16] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[18:19:57] <gehel>	 !log repool wdqs1007, catched up on lag - T238229
[18:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:03] <stashbot>	 T238229: WDQS is having high update lag for the last week - https://phabricator.wikimedia.org/T238229
[18:31:06] <icinga-wm>	 PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100%
[18:53:36] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "It seems that there is some problem with pynetbox and how it queries the API, see inline." (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov)
[19:21:51] <wikibugs>	 (03PS1) 10Zoranzoki21: Add throttle rule for WMCL Editathon 2019-12-07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552640 (https://phabricator.wikimedia.org/T238986)
[19:22:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add throttle rule for WMCL Editathon 2019-12-07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552640 (https://phabricator.wikimedia.org/T238986) (owner: 10Zoranzoki21)
[19:24:39] <wikibugs>	 (03PS2) 10Zoranzoki21: Add throttle rule for WMCL Editathon 2019-12-07 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552640 (https://phabricator.wikimedia.org/T238986)
[20:16:16] <wikibugs>	 (03PS1) 10Zoranzoki21: Equalization of wgPopupsReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552643
[20:20:44] <wikibugs>	 (03PS2) 10Zoranzoki21: Equalization of wgPopupsReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552643
[20:57:10] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Urbanecm) >>! In T238285#5686501, @Joseagush wrote: > Well, now all templates (Mal) said "Galat script: no modu...
[23:15:24] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.51 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:18:50] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.03 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:40:04] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5542 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[23:43:32] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.004167 https://wikitech.wikimedia.org/wiki/Logstash https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash