[00:22:10] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [00:46:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 44 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:02:40] 10Operations, 10Product-Analytics: Upload shiny-server .deb to our Jessie apt repository - https://phabricator.wikimedia.org/T168967 (10Legoktm) It looks like the Debian Science team was working on packaging this for Debian (cf https://salsa.debian.org/science-team/shiny-server) but I don't see an actual ITP.... [01:21:52] 10Operations, 10Traffic, 10Performance-Team (Radar): Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954 (10Krinkle) [01:42:16] 10Operations, 10monitoring: add monitoring to alert on hosts without RAID - https://phabricator.wikimedia.org/T206131 (10Dzahn) p:05Triage>03Normal [01:43:34] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: rspec-puppet fails with Could not find the daemon directory (tested [/etc/sv,/var/lib/service]) - https://phabricator.wikimedia.org/T203645 (10Dzahn) p:05Triage>03Normal [01:47:32] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:54:18] (03PS1) 10Dzahn: nagios_common: on jessie, also install libmonitoring-plugin-perl [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) [01:54:43] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 44 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [01:59:43] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:02:19] PROBLEM - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:05:32] (03PS1) 10Dzahn: nagios_common: switch check_ssl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) [02:06:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:06:58] (03PS2) 10Dzahn: nagios_common: switch check_ssl to use Monitoring::Plugin [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) [02:09:04] (03Abandoned) 10Dzahn: icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:10:16] (03CR) 10Dzahn: "thanks for that proposal, that sounds good, i uploaded:" [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:11:05] (03CR) 10Dzahn: [C: 04-2] nagios_common: add stretch support to check_ssl [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:12:09] (03CR) 10Dzahn: "followed by https://gerrit.wikimedia.org/r/#/c/467013/" [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:13:50] (03Abandoned) 10Dzahn: tor: class to extract fingerprints of multiple relays (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/461197 (owner: 10Dzahn) [02:14:50] (03Abandoned) 10Dzahn: cache::misc: add apache-fast-test script [puppet] - 10https://gerrit.wikimedia.org/r/423557 (owner: 10Dzahn) [02:16:26] (03PS2) 10Dzahn: tor::relay: make Tor family configurable and move to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/459876 [02:18:19] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:18] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 277 bytes in 2.722 second response time [02:23:03] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) @Cwhite regarding the Juniper alerts that have "Can't locate Nagios/Plugin/Getopt.pm", Google sent me to our own phab where Fundraising tech fixed it... [02:24:39] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:18] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) new approach to make check_ssl work: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/467011/ followed by https://gerrit.wikimedia.org/r/#/c/... [02:27:49] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.005 second response time [02:37:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:37:58] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) confirmed on icinga1001 that the fix for check_ssl is actually: ``` vi check_ssl :0,$s/Nagios::/Monitoring::/g :0,$s/nagios_exit/plugin_exit/g :wq `... [02:39:36] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T202782#4663046" [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:44:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 50 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [02:49:16] (03PS1) 10Dzahn: icinga: replace Nagios::Plugin with Monitoring::Plugin in multiple Perl scripts [puppet] - 10https://gerrit.wikimedia.org/r/467015 (https://phabricator.wikimedia.org/T202782) [02:58:38] (03PS1) 10Dzahn: icinga/etcd: /var/run/icinga/ -> /var/run/nagios/ [puppet] - 10https://gerrit.wikimedia.org/r/467017 (https://phabricator.wikimedia.org/T202782) [03:00:58] (03CR) 10Dzahn: "see the UNKNOWN column for mw servers https://icinga-stretch.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustyp" [puppet] - 10https://gerrit.wikimedia.org/r/467017 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [03:06:31] (03PS1) 10BBlack: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/467018 [03:07:03] (03CR) 10BBlack: [C: 032] Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/467018 (owner: 10BBlack) [03:07:57] !log eqsin repooled [03:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:04] ACKNOWLEDGEMENT - Host cloudvirt1019 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T196507 [03:10:09] ACKNOWLEDGEMENT - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://gerrit.wikimedia.org/r/#/c/operations/dns/+/467018/ [03:11:59] 10Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T164955 (10Dzahn) 05duplicate>03Open https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=heze&service=Device+not+healthy+-SMART- Current Status: CRITICAL (for 0d 8h 6m 11s) Status Information: cluster... [03:12:39] ACKNOWLEDGEMENT - Device not healthy -SMART- on heze is CRITICAL: cluster=misc device=megaraid,8 instance=heze:9100 job=node site=codfw daniel_zahn https://phabricator.wikimedia.org/T164955 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=heze&var-datasource=codfw%2520prometheus%252Fops [03:13:49] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Dzahn) [03:13:52] 10Operations, 10ops-eqiad, 10DBA: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Dzahn) 05Resolved>03Open https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1069&service=Memory+correctable+errors+-EDAC- Service Memory correctable errors -EDAC- On Host... [03:14:10] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on db1069 is CRITICAL: 8.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T201133 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [03:18:28] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 57.26 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:31:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 874.19 seconds [03:49:39] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 76.76 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:50:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:57:39] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [04:01:19] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 293.19 seconds [05:09:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:13:12] (03CR) 10Krinkle: Move auth logging to different channels for easier counting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464077 (https://phabricator.wikimedia.org/T150300) (owner: 10Gergő Tisza) [05:13:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:18:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:20:18] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:23:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:26:40] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:30:49] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [05:30:59] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 44 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:28:29] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:28:29] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/bash/puppet-common.sh] [06:28:48] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/smartmontools/run.d/20logger] [06:28:49] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [06:42:09] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:49:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:58:49] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:58] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:18] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:18] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:12:08] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.27 seconds [07:30:50] all the above mw memcached errors seems to be related to yields: https://grafana.wikimedia.org/dashboard/db/memcache?orgId=1&panelId=38&fullscreen&from=1539405530990&to=1539410221201 [07:31:12] for the mc1035 shard [07:48:08] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 179.85 seconds [08:02:18] RECOVERY - DPKG on ores1001 is OK: All packages OK [08:12:19] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: arwiki page giving "entire web request took longer than 60 seconds and timed out" - https://phabricator.wikimedia.org/T206878 (10alanajjar) >Since yesterday, many users are complaining that they can't open this page (portal page), and when try to o... [08:13:17] Hello, can any one take a look at this https://phabricator.wikimedia.org/T206878 ? Thanks on advance [08:25:39] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:32:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 42 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [09:17:38] RECOVERY - Device not healthy -SMART- on heze is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=heze&var-datasource=codfw%2520prometheus%252Fops [09:54:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:01:19] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 57 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:43:09] PROBLEM - MegaRAID on heze is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [10:43:20] ACKNOWLEDGEMENT - MegaRAID on heze is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206909 [10:43:25] 10Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T206909 (10ops-monitoring-bot) [10:55:31] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 46.41 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:58:51] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 95.05 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:07:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:14:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:32:56] (03CR) 10Faidon Liambotis: nagios_common: on jessie, also install libmonitoring-plugin-perl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/467011 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [11:33:40] (03CR) 10Faidon Liambotis: [C: 031] "LGTM, but do test it before deploying!" [puppet] - 10https://gerrit.wikimedia.org/r/467013 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [11:34:51] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [11:42:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 43 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [12:19:44] 10Operations, 10monitoring: add monitoring to alert on hosts without RAID - https://phabricator.wikimedia.org/T206131 (10Volans) I'm thinking that it would be an overkill to have this as an Icinga NRPE check. Some other alternatives might be: - check Icinga configuration to ensure that all hosts have at least... [12:49:01] 10Operations, 10monitoring: add monitoring to alert on hosts without RAID - https://phabricator.wikimedia.org/T206131 (10Dzahn) Yes, i was thinking the same. I did not expect to add this as an Icinga check. I expected to add a script that is run by cron and sends out email to tell us. [12:49:32] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [12:50:01] RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [12:58:11] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:05:21] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 86 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:10:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:17:41] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:27:50] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 25 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:35:00] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 38 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:06:10] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:13:21] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:13:26] (03PS3) 10Zoranzoki21: Add throttle rules and remove outdated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) [15:20:50] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:20:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:21:00] PROBLEM - Host restbase1015 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:01] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [15:21:30] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:21:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:21:40] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [15:21:40] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [15:22:01] RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy [15:22:31] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [15:22:40] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy [15:22:51] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [15:23:30] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [15:24:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [15:24:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [15:28:40] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:31:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [15:31:31] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [15:34:11] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [15:35:51] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [15:37:31] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [15:40:50] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [15:41:01] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [15:42:20] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [15:42:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [15:42:41] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [15:43:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [15:43:41] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [15:45:21] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [15:47:21] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [15:47:51] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:47:51] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [15:49:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [15:49:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [15:50:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [15:52:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [15:53:10] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [15:53:11] PROBLEM - Restbase root url on restbase1008 is CRITICAL: connect to address 10.64.32.178 and port 7231: Connection refused [15:55:10] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [15:55:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [15:56:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:58:20] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:59:20] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [16:00:41] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.348 second response time [16:01:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:03:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:08:20] PROBLEM - Restbase root url on restbase1008 is CRITICAL: HTTP CRITICAL - No data received from host [16:10:30] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.010 second response time [16:10:40] PROBLEM - MD RAID on aqs1006 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 [16:10:42] ACKNOWLEDGEMENT - MD RAID on aqs1006 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206915 [16:10:46] 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10ops-monitoring-bot) [16:10:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [16:11:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [16:11:51] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [16:13:31] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 205.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [16:23:21] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:24:21] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [16:27:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:27:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:28:01] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [16:30:10] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [16:32:31] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [16:32:35] (03PS4) 10Urbanecm: Add two throttle rules and remove outdated rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [16:32:46] (03PS5) 10Urbanecm: Add two throttle rules and remove outdated rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [16:32:57] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [16:35:03] (03PS6) 10Urbanecm: Add two throttle rules and remove outdated rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) (owner: 10Zoranzoki21) [16:36:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:36:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 27 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:37:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:38:21] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) timed out before a response was received [16:38:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [16:39:21] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [16:42:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:43:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [16:43:50] (03PS1) 10Urbanecm: Add shn to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/467080 (https://phabricator.wikimedia.org/T206777) [16:43:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 64 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:44:03] (03CR) 10jerkins-bot: [V: 04-1] Add shn to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/467080 (https://phabricator.wikimedia.org/T206777) (owner: 10Urbanecm) [16:45:41] PROBLEM - Restbase root url on restbase1013 is CRITICAL: connect to address 10.64.32.80 and port 7231: Connection refused [16:46:10] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) timed out before a response was received [16:47:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [16:47:10] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [16:47:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:48:09] 10Operations, 10Thumbor: in Commons, some PDFs are failing to render thumbnails. - https://phabricator.wikimedia.org/T203402 (10Aklapper) [16:49:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [16:50:20] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [16:54:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:54:21] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.010 second response time [16:55:11] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:55:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [16:55:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [16:55:49] 10Operations, 10Certcentral, 10Traffic: Integrate certspotter with certcentral to avoid certspotter notifying us on legitimate certs generated by our certcentral boxes - https://phabricator.wikimedia.org/T204994 (10Krenair) [16:57:20] 10Operations, 10Certcentral, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) [16:57:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [16:58:07] 10Operations, 10Certcentral, 10Traffic, 10vm-requests, 10Patch-For-Review: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Krenair) [16:58:11] 10Operations, 10Certcentral, 10Traffic, 10Continuous-Integration-Config, and 2 others: Set up CI for new repo operations/software/certcentral.git - https://phabricator.wikimedia.org/T198541 (10Krenair) [16:58:16] 10Operations, 10Certcentral, 10Traffic: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Krenair) [16:58:22] 10Operations, 10Certcentral, 10Traffic, 10Continuous-Integration-Config: Provide a CI container with pebble - https://phabricator.wikimedia.org/T200405 (10Krenair) [16:58:25] 10Operations, 10Certcentral, 10Traffic, 10Patch-For-Review: certcentral: phantom test failure around challenge success - https://phabricator.wikimedia.org/T203422 (10Krenair) [16:58:28] 10Operations, 10Certcentral, 10Traffic, 10Patch-For-Review: certcentral: Make configurable the cmd executed to perform a DNS zone update - https://phabricator.wikimedia.org/T203678 (10Krenair) [16:58:31] 10Operations, 10Certcentral, 10Traffic, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10Krenair) [16:58:50] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [16:59:18] 10Operations, 10Certcentral, 10Traffic: certcentral: challenge checking on *all* pooled backend hosts - https://phabricator.wikimedia.org/T203396 (10Krenair) [16:59:29] 10Operations, 10Certcentral, 10Traffic: certcentral: Provide script for certificate revocation - https://phabricator.wikimedia.org/T203423 (10Krenair) [16:59:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [16:59:46] 10Operations, 10Certcentral, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10Krenair) [17:00:00] PROBLEM - Restbase root url on restbase1013 is CRITICAL: HTTP CRITICAL - No data received from host [17:01:48] (03CR) 10Mathew.onipe: [C: 031] My tests show that Kafka poller behaves much better with -b 700 [puppet] - 10https://gerrit.wikimedia.org/r/467002 (owner: 10Smalyshev) [17:02:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:02:30] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [17:03:20] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [17:03:51] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [17:04:20] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.024 second response time [17:04:51] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [17:05:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) timed out before a response was received [17:05:30] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [17:05:41] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [17:06:21] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 42 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:07:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [17:08:10] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [17:09:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:11:50] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [17:13:01] PROBLEM - Restbase root url on restbase1013 is CRITICAL: HTTP CRITICAL - No data received from host [17:14:11] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.027 second response time [17:14:51] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [17:15:11] (03PS57) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [17:15:51] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [17:15:56] 10Operations, 10Traffic, 10HTTPS: letsencrypt puppetization: add parallel rsa+ecdsa cert support - https://phabricator.wikimedia.org/T141266 (10Krenair) @bblack ? [17:16:31] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: letsencrypt puppetization: upgrade for scalability - https://phabricator.wikimedia.org/T134447 (10Krenair) @bblack ? [17:17:21] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [17:19:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [17:25:30] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January [17:25:30] e a response was received [17:26:41] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [17:26:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [17:27:31] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [17:31:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [17:31:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [17:33:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [17:40:21] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [17:41:01] PROBLEM - Device not healthy -SMART- on aqs1006 is CRITICAL: cluster=aqs device=sde instance=aqs1006:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1006&var-datasource=eqiad%2520prometheus%252Fops [17:42:51] there is https://phabricator.wikimedia.org/T206915 for --^ [17:44:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [17:44:10] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [17:46:31] PROBLEM - Restbase root url on restbase1011 is CRITICAL: connect to address 10.64.0.113 and port 7231: Connection refused [17:49:40] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [17:50:31] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [17:52:10] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:54:20] RECOVERY - Restbase root url on restbase1011 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.011 second response time [17:56:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:57:50] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:59:41] PROBLEM - Restbase root url on restbase1011 is CRITICAL: connect to address 10.64.0.113 and port 7231: Connection refused [18:04:00] RECOVERY - Restbase root url on restbase1011 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.015 second response time [18:04:30] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:06:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:06:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:06:40] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200) [18:06:41] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [um [18:07:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:07:41] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:08:50] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [18:09:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:09:50] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [18:10:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:11:10] RECOVERY - Device not healthy -SMART- on aqs1006 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1006&var-datasource=eqiad%2520prometheus%252Fops [18:12:11] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:12:41] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:12:42] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:15:31] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:16:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:16:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:17:41] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:19:20] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:19:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:21:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:22:11] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:22:50] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:22:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:23:40] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=tru [18:23:40] xpected status 504 (expecting: 200) [18:25:50] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [18:25:51] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sde1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [18:26:41] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:28:50] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received [18:29:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:32:00] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [18:32:11] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:33:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [18:33:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:34:21] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:37:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:41:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:41:30] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:41:30] (03PS7) 10Zoranzoki21: Add two throttle rules and remove outdated rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) [18:43:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [18:50:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:53:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [18:55:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:58:21] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:05:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:05:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:06:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:11:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:12:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:12:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:13:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [19:16:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:16:32] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:20:21] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:20:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [19:20:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:21:21] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:21:30] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [19:22:41] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:24:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:24:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:30:00] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 52 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:30:38] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10User-Urbanecm: Remove *.cz domains from WMF's infrastructure - https://phabricator.wikimedia.org/T206923 (10Urbanecm) [19:31:21] (03PS1) 10Urbanecm: Remove *.cz [dns] - 10https://gerrit.wikimedia.org/r/467087 (https://phabricator.wikimedia.org/T206923) [19:36:50] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:38:09] (03PS1) 10Urbanecm: Remove *.cz redirects [puppet] - 10https://gerrit.wikimedia.org/r/467088 [19:38:10] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:39:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [19:41:31] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [19:48:10] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:48:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [19:49:10] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [19:51:09] (03PS2) 10Urbanecm: Remove *.cz redirects [puppet] - 10https://gerrit.wikimedia.org/r/467088 (https://phabricator.wikimedia.org/T206923) [19:51:19] (03PS3) 10Urbanecm: Remove *.cz redirects [puppet] - 10https://gerrit.wikimedia.org/r/467088 (https://phabricator.wikimedia.org/T206923) [19:52:30] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:56:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:56:50] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:57:51] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [19:57:51] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [19:59:41] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [19:59:50] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10Krinkle) [20:00:41] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [20:07:01] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) 05Resolved>03Open p:05Normal>03High This started happening again - the events jumped starting Oct 11 and ar... [20:15:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:26:36] (03CR) 10Framawiki: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/467087 (https://phabricator.wikimedia.org/T206923) (owner: 10Urbanecm) [20:27:38] (03CR) 10Framawiki: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/467088 (https://phabricator.wikimedia.org/T206923) (owner: 10Urbanecm) [20:27:50] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:41:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:42:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [20:44:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:45:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:46:21] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [20:46:30] PROBLEM - Restbase root url on restbase1013 is CRITICAL: connect to address 10.64.32.80 and port 7231: Connection refused [20:48:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [20:49:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [20:49:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [20:52:21] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:52:21] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:53:01] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:53:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [20:54:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [20:54:11] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.012 second response time [20:56:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:57:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:59:31] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [20:59:40] PROBLEM - Restbase root url on restbase1013 is CRITICAL: HTTP CRITICAL - No data received from host [21:02:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [21:02:21] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [21:03:00] RECOVERY - Restbase root url on restbase1013 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 4.083 second response time [21:29:23] (03CR) 10Gergő Tisza: Move auth logging to different channels for easier counting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464077 (https://phabricator.wikimedia.org/T150300) (owner: 10Gergő Tisza) [21:29:41] (03PS2) 10Gergő Tisza: Move auth logging to different channels for easier counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464077 (https://phabricator.wikimedia.org/T150300) [21:33:01] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [21:35:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [21:36:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [21:40:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [21:42:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [21:43:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [21:44:01] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [21:44:10] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received [21:46:30] PROBLEM - Restbase root url on restbase1011 is CRITICAL: connect to address 10.64.0.113 and port 7231: Connection refused [21:48:20] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [21:48:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [21:48:50] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [21:49:30] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [21:51:00] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [21:51:00] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [21:54:10] RECOVERY - Restbase root url on restbase1011 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.010 second response time [21:54:20] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [21:58:10] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [21:58:30] PROBLEM - Restbase root url on restbase1011 is CRITICAL: connect to address 10.64.0.113 and port 7231: Connection refused [21:58:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [21:59:11] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [22:02:40] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:03:50] RECOVERY - Restbase root url on restbase1011 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.012 second response time [22:05:51] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [22:06:20] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [22:06:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [22:09:11] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:09:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:10:11] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [22:10:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [22:18:30] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [22:19:30] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [22:20:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:22:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [22:24:11] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:25:02] (03CR) 10Krinkle: [C: 031] Move auth logging to different channels for easier counting (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464077 (https://phabricator.wikimedia.org/T150300) (owner: 10Gergő Tisza) [22:25:28] (03CR) 10Krinkle: [C: 031] "(concern isn't blocking, but given it's wmf-config, doing +1 instead of +2)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464077 (https://phabricator.wikimedia.org/T150300) (owner: 10Gergő Tisza) [22:26:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [22:26:01] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:31:30] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 54 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:33:31] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [22:33:31] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [22:34:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:38:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:38:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [22:39:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [22:44:21] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [22:44:21] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [22:45:01] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:46:10] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [22:47:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [22:47:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [22:50:51] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [22:50:51] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [22:55:00] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [22:56:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [22:57:40] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [22:58:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [23:05:30] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:07:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:07:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [23:07:40] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [23:10:50] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [23:13:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [23:13:40] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [23:14:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:20:31] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) timed out before a response was received [23:20:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [23:22:20] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [23:22:40] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [23:22:51] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [23:23:24] (03PS5) 10Krinkle: Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:23:45] (03CR) 10jerkins-bot: [V: 04-1] Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:24:21] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [23:24:46] (03CR) 10Krinkle: [C: 032] "Confirmed for five random 30min ranges ranges in Logstash that all hhvm messages that weren't about SlowTimer, ended up in mediawiki/fatal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:24:57] (03CR) 10jerkins-bot: [V: 04-1] Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:26:03] (03PS6) 10Krinkle: Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:26:09] (03CR) 10Krinkle: [C: 032] Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:26:12] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:26:20] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:26:20] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [23:27:28] (03Merged) 10jenkins-bot: Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:30:31] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [23:30:31] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [23:31:07] (03PS1) 10Smalyshev: Enable tracking lexemes in Updater [puppet] - 10https://gerrit.wikimedia.org/r/467097 [23:31:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:33:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [23:36:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [23:36:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [23:37:21] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T45086 - I4857e8ac (duration: 00m 51s) [23:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:25] T45086: Capture PHP warnings with stacktraces in MediaWiki and save to logstash - https://phabricator.wikimedia.org/T45086 [23:39:10] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) timed out before a response was received [23:39:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [23:39:20] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [23:40:10] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [23:40:58] (03PS8) 10Zoranzoki21: Add two throttle rules and remove outdated rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) [23:41:05] (03PS9) 10Zoranzoki21: Add two throttle rules and remove outdated rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) [23:43:20] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:43:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:45:21] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [23:46:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [23:46:00] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [23:46:28] (03CR) 10jenkins-bot: Set $wgPropagateErrors to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423338 (https://phabricator.wikimedia.org/T45086) (owner: 10Gergő Tisza) [23:53:40] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [23:53:40] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [23:57:16] (03PS1) 10Paladox: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 [23:57:55] (03CR) 10jerkins-bot: [V: 04-1] Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/467100 (owner: 10Paladox) [23:59:11] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [23:59:11] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received