[00:05:59] <wikibugs>	 (03PS3) 10Dzahn: base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782)
[00:08:27] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12880/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[00:09:11] <wikibugs>	 (03CR) 10Dzahn: "per the suggestion above. but yea, it still has the different format as mentioned by Alex" [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[00:10:33] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "needs more string mangling, ack" [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[00:13:36] <thifranc>	 Hi, is this the good place for chatting/asking questions regarding contributing to wikimedia or is there another channel ?
[00:14:28] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 60 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[00:16:43] <mutante>	 thifranc: it's not the worst one :) feel free to ask me.. just about to leave and be back in a little while
[00:17:01] <mutante>	 there is also #wikimedia-tech i guess both can work
[00:17:27] <mutante>	 there are many more but it depends what you want to work on
[00:19:20] <mutante>	 thifranc: https://meta.wikimedia.org/wiki/IRC/Channels#General_development_and_technical_discussion
[00:19:25] <thifranc>	 Well I've only got the basic setup, which gerrit-ldap-phabricator accounts, I'm more into sysadmin, I've read some docs I've encountered on wikimedia, but I don't really know where to start
[00:20:13] <mutante>	 not wrong here for sysadmin stuff
[00:20:23] <mutante>	 we use puppet for everything
[00:22:30] <mutante>	 thifranc: you would want to start with a a user on wikitech.wikimedia.org 
[00:22:37] <thifranc>	 I use ansible for work, do you think I should first play with puppet before reading wikimedia code ?
[00:22:39] <mutante>	 oh, you already have that i guess
[00:22:57] <mutante>	 yea, you can clone the puppet repo and take a look, it's all public
[00:23:11] <mutante>	 and see https://wikitech.wikimedia.org/wiki/Puppet_coding
[00:23:24] <mutante>	 it starts with how to clone
[00:23:34] <mutante>	 and continues to talk about puppet style
[00:23:45] <thifranc>	 hmm yeah I've read this page about puppet formatting
[00:24:50] <thifranc>	 but on phabricator, are tickets sysadmin related tagged specifically ?
[00:25:23] <mutante>	 thifranc: yea, a common one is https://phabricator.wikimedia.org/tag/operations/
[00:25:54] <thifranc>	 and is there any resource explaining like the basics of wikimedia infras ?
[00:27:38] <mutante>	 in general, the wikitech wiki as a whole
[00:27:42] <mutante>	 there is also https://en.wikipedia.org/wiki/Wikimedia_Foundation#Technology
[00:28:08] <mutante>	 https://wikitech.wikimedia.org/wiki/File:Infrastructure_overview.png
[00:32:38] <thifranc>	 awesome ! thks for those schema, that's what I was looking for
[02:00:08] <icinga-wm>	 PROBLEM - Filesystem available is greater than filesystem size on ms-be1043 is CRITICAL: cluster=swift device=/dev/sdh1 fstype=xfs instance=ms-be1043:9100 job=node mountpoint=/srv/swift-storage/sdh1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops
[02:05:07] <wikibugs>	 (03PS1) 10Andrew Bogott: Make cloudvirt1023 a compute node [puppet] - 10https://gerrit.wikimedia.org/r/466817 (https://phabricator.wikimedia.org/T199125)
[02:09:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Make cloudvirt1023 a compute node [puppet] - 10https://gerrit.wikimedia.org/r/466817 (https://phabricator.wikimedia.org/T199125) (owner: 10Andrew Bogott)
[02:13:47] <wikibugs>	 (03PS1) 10Andrew Bogott: Define profile::openstack::base::nova::instance_dev for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/466818
[02:15:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Define profile::openstack::base::nova::instance_dev for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/466818 (owner: 10Andrew Bogott)
[02:31:48] <icinga-wm>	 PROBLEM - puppet last run on cloudvirt1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:36:24] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle) > The MediaWiki eqiad-appserver cluster **gasping for air**,  | {F26543386 height=300} | //[figure 1.](https://grafana.wikimedia.org/dash...
[02:37:29] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle)
[02:46:48] <icinga-wm>	 RECOVERY - puppet last run on cloudvirt1023 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[03:06:03] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) > If update lag is not a big issue for our users, then we should make it clear...
[03:16:56] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[03:17:58] <wikibugs>	 (03PS2) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639)
[03:24:07] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 48 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[03:30:16] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 930.40 seconds
[03:53:07] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 226.74 seconds
[04:06:41] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > Our production instance consume ~3-4K IOPS just for updates.  To clarify here, this would just...
[04:12:50] <wikibugs>	 10Operations, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786 (10BPirkle)
[04:53:17] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured im
[04:53:17] <icinga-wm>	 29, 2016 returned the unexpected status 404 (expecting: 200)
[04:54:27] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[05:08:27] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200)
[05:09:36] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[05:23:07] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1173 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[05:46:17] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[05:53:37] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 48 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[05:54:13] <moritzm>	 !log installing git security updates on trusty
[05:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:58:46] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[05:59:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[06:05:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good, PCC is also fine: https://puppet-compiler.wmflabs.org/compiler1002/12881/" [puppet] - 10https://gerrit.wikimedia.org/r/466696 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[06:05:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[06:27:43] <wikibugs>	 (03PS3) 10Muehlenhoff: Cleanup systemd state on Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454)
[06:29:06] <icinga-wm>	 PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/systemd/system/nginx.service.d/security.conf]
[06:32:28] <icinga-wm>	 PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs]
[06:49:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Cleanup systemd state on Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[06:57:58] <icinga-wm>	 RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:27] <icinga-wm>	 RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[07:00:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove diamonn on webserver_misc_static role [puppet] - 10https://gerrit.wikimedia.org/r/466825 (https://phabricator.wikimedia.org/T183454)
[07:00:40] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove diamond on webserver_misc_static role [puppet] - 10https://gerrit.wikimedia.org/r/466825 (https://phabricator.wikimedia.org/T183454)
[07:02:38] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove diamond on webserver_misc_static role [puppet] - 10https://gerrit.wikimedia.org/r/466825 (https://phabricator.wikimedia.org/T183454)
[07:04:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove diamond on webserver_misc_static role [puppet] - 10https://gerrit.wikimedia.org/r/466825 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[07:09:08] <wikibugs>	 (03CR) 10KartikMistry: [C: 04-1] "To be uploaded only with new apertium package." [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/465932 (https://phabricator.wikimedia.org/T206439) (owner: 10KartikMistry)
[07:10:49] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454)
[07:12:33] <elukey>	 gilles: o/
[07:12:43] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454)
[07:12:45] <elukey>	 do you know if we plot somewhere thumbor's memcache stats?
[07:13:21] <gilles>	 elukey: I'm not sure
[07:13:28] <gilles>	 we might not
[07:14:01] <elukey>	 because I am upgrading the prometheus exporter and I was wondering where to check metrics
[07:14:07] <elukey>	 maybe we collect only mc*
[07:14:38] <elukey>	 yeah seems so
[07:19:02] <wikibugs>	 (03PS1) 10Elukey: role::prometheus::ops: collect memcached stats from thumbor/swift [puppet] - 10https://gerrit.wikimedia.org/r/466828
[07:19:42] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) >>! In T183454#4657874, @MoritzMuehlenhoff wrote: > Instead we can simply workaround this in puppet: https://gerrit.wikimedia....
[07:20:13] <wikibugs>	 (03PS2) 10Muehlenhoff: Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/465413
[07:23:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/465413 (owner: 10Muehlenhoff)
[07:26:23] <wikibugs>	 (03CR) 10Elukey: "So https://puppet-compiler.wmflabs.org/compiler1002/12882/ looks good but I am not sure if the new labels for the memcached cluster implie" [puppet] - 10https://gerrit.wikimedia.org/r/466828 (owner: 10Elukey)
[07:37:17] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:37:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove sarin/neodymium from network constants/tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/466830
[07:39:14] <wikibugs>	 (03CR) 10Fomafix: "Since d59f27aeab08b171e5ab6a081e763a4cad0bca04 the variants sr-cyrl and sr-latn are already supported." [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix)
[07:40:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833
[07:43:16] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 3656 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[07:44:36] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[08:06:26] <wikibugs>	 10Operations, 10cloud-services-team: Create a jessie netboot image with the 4.9 Linux kernel - https://phabricator.wikimedia.org/T206761 (10MoritzMuehlenhoff) The current set of cloudvirt* installed fine with the stock jessie 3.16 kernel as the 10G NIC is disabled, so this is not needed immediately. We can sti...
[08:11:28] <wikibugs>	 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10elukey) 05Open>03Resolved
[08:11:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) 05Open>03Resolved
[08:22:57] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 48.85 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[08:25:35] <wikibugs>	 (03PS2) 10Elukey: admin: add turnilo and superset sudo privs to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) (owner: 10Herron)
[08:28:27] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.22 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[08:33:56] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek)  The binlog purging on codfw was started yesterday (sorry I didn't logged it here), and it runs since; the replication works, and the disk...
[08:48:34] <wikibugs>	 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) The ones I have seen are relatively short burst of errors in error counters in interfaces on WDQS (`node_network_receive...
[08:52:37] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[08:53:46] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.008 second response time
[08:58:13] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10MoritzMuehlenhoff) When you're initiating an install, do you see a GET for autoinstall/preseed.cfg showing up in ng...
[09:01:45] <elukey>	 !log rolling restart of eventbus on kafka[1,2]00[1-3] to pick up python security upgrades
[09:01:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:51] <elukey>	 Cc: mobrovac --^
[09:05:59] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe)
[09:17:26] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for J
[09:17:27] <icinga-wm>	 h aggregated=true) returned the unexpected status 504 (expecting: 200)
[09:18:07] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454)
[09:18:27] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[09:18:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[09:26:17] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) First, the timeline:  - Internal traffic starts flowing through eqiad in the interval 14:14:44 - 14:15:03 - External traffic...
[09:32:56] <icinga-wm>	 PROBLEM - HHVM rendering on mw1342 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[09:32:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1342 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.005 second response time
[09:33:57] <icinga-wm>	 RECOVERY - HHVM rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 81961 bytes in 0.143 second response time
[09:34:06] <icinga-wm>	 RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.028 second response time
[09:47:15] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10akosiaris)
[09:49:01] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiasco post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe)
[09:51:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[09:51:36] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe)
[09:52:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-coal [puppet] - 10https://gerrit.wikimedia.org/r/465593 (https://phabricator.wikimedia.org/T135991)
[09:54:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for uwsgi-coal [puppet] - 10https://gerrit.wikimedia.org/r/465593 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:56:08] <wikibugs>	 (03CR) 10Volans: [C: 032] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[09:57:30] <wikibugs>	 (03Merged) 10jenkins-bot: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[09:58:03] <addshore>	 jouncebot: next
[09:58:03] <jouncebot>	 In 72 hour(s) and 31 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T1030)
[09:58:08] <addshore>	 oh, its friday duh
[09:58:52] * addshore is going to backport https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/466843/ as that version of the eventlogging schema is needed for something in 2 hours
[10:01:40] <wikibugs>	 (03PS3) 10Volans: cookbook: split main() into parse_args() and run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079)
[10:01:46] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[10:02:13] <wikibugs>	 (03CR) 10Volans: "Re-open for final review and ready to be merged" [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[10:03:17] <wikibugs>	 (03PS3) 10Volans: etcd-config: add check for directory [puppet] - 10https://gerrit.wikimedia.org/r/465197 (https://phabricator.wikimedia.org/T199413)
[10:04:26] <wikibugs>	 (03CR) 10Volans: [C: 032] etcd-config: add check for directory [puppet] - 10https://gerrit.wikimedia.org/r/465197 (https://phabricator.wikimedia.org/T199413) (owner: 10Volans)
[10:05:07] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for
[10:05:07] <icinga-wm>	 turned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections
[10:05:07] <icinga-wm>	 ected status 504 (expecting: 200)
[10:06:16] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[10:06:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup)
[10:07:02] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup)
[10:07:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup)
[10:08:07] <wikibugs>	 (03CR) 10Mark Bergsma: cookbook: split main() into parse_args() and run() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans)
[10:08:41] <logmsgbot>	 !log addshore@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/WikimediaEvents/extension.json: T205283 [[gerrit:466843]] Update Schema:WMDEBannerEvents rev to 18437830 (duration: 00m 52s)
[10:08:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:44] <stashbot>	 T205283: Measure usage of banner slider functionality - https://phabricator.wikimedia.org/T205283
[10:09:07] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[10:09:37] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: 
[10:09:37] <icinga-wm>	 ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{dom
[10:09:37] <icinga-wm>	 -html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{dom
[10:10:47] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[10:12:42] <wikibugs>	 (03PS1) 10Banyek: mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593)
[10:13:57] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: 
[10:13:57] <icinga-wm>	 ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{dom
[10:13:57] <icinga-wm>	 -html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{dom
[10:15:07] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[10:17:37] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200)
[10:18:37] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy
[10:21:00] <wikibugs>	 (03PS1) 10Banyek: mariadb: produtionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593)
[10:21:58] <wikibugs>	 (03PS2) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593)
[10:22:39] <wikibugs>	 (03CR) 10Muehlenhoff: "Picking up various comments together:" [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:22:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek)
[10:23:17] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 201
[10:23:17] <icinga-wm>	 xpected value at path = Missing keys: [umostread, uimage]
[10:24:12] <wikibugs>	 (03PS3) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593)
[10:24:26] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[10:27:39] <hoo>	 !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds
[10:27:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:42] <stashbot>	 T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839
[10:31:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: 
[10:31:27] <icinga-wm>	 ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected st
[10:31:27] <icinga-wm>	 : 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016
[10:31:54] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "Missing the IP entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek)
[10:32:28] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "Missing the host yaml, the shard, the binlog format." [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek)
[10:33:30] <wikibugs>	 (03CR) 10Banyek: "ok, I'll check them!" [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek)
[10:34:07] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the 
[10:34:07] <icinga-wm>	 for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200)
[10:34:59] <_joe_>	 504 means restbase timed out before getting a response I guess?
[10:35:17] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[10:35:37] <_joe_>	 it's probably just scb1004
[10:36:23] <elukey>	 I was about to ask the same
[10:37:29] <_joe_>	 all endpoints healthy now 
[10:37:56] <_joe_>	 it's probably some slowness in the API then, seeing restbase had the same issue
[10:38:07] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[10:39:09] <_joe_>	 yes https://grafana.wikimedia.org/dashboard/db/restbase?panelId=12&fullscreen&orgId=1&from=now-1h&to=now
[10:51:06] <_joe_>	 !log depooling mw2252 for mcrouter tests T203786
[10:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:10] <stashbot>	 T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786
[10:52:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for nginx on debug proxies [puppet] - 10https://gerrit.wikimedia.org/r/466852 (https://phabricator.wikimedia.org/T135991)
[10:56:08] <wikibugs>	 (03PS2) 10Banyek: mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593)
[11:10:55] <wikibugs>	 10Operations, 10Traffic: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) Also remembering there's some stats @Krinkle mentioned in T196248 about clock skews and users from some Google research.  The TL;DR there was 24 hours gives us 93.3% , and 5 days is the sweet spot g...
[11:15:16] <wikibugs>	 10Operations, 10Traffic: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack)
[11:18:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991)
[11:19:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[11:24:41] <wikibugs>	 (03PS1) 10Banyek: mariadb: reimage db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466856 (https://phabricator.wikimedia.org/T206593)
[11:27:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991)
[11:29:51] <wikibugs>	 (03PS4) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593)
[11:31:40] <wikibugs>	 (03PS3) 10Muehlenhoff: rsyncd: Add option to generate ferm rules based on $hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/465378
[11:55:50] <bblack>	 jouncebot: next
[11:55:50] <jouncebot>	 In 70 hour(s) and 34 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T1030)
[11:56:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] rsyncd: Add option to generate ferm rules based on $hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff)
[11:56:53] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, and 3 others: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630 (10Ladsgroup) It works: https://logstash-beta.wmflabs.org/goto/f9a3fea8c95c02724813fb7cbebb6471 Shall we go prod? let's do it on Monday.
[12:06:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch yhsm_aead_sync to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/466867
[12:19:04] <wikibugs>	 (03PS2) 10Elukey: Refactor type Systemd::Timer::DateTime to include more normal forms [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532)
[12:19:30] <wikibugs>	 (03CR) 10Elukey: "any thought? @BStorm? :)" [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[12:20:33] <bblack>	 !log uploading gdnsd 2.99.9942-beta-1+wmf1 to stretch-wikimedia
[12:20:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:24] <wikibugs>	 (03PS2) 10Zoranzoki21: Add throttle rule and remove outdated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408)
[12:36:27] <icinga-wm>	 RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms
[12:43:36] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[12:44:05] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Jgreen)      >>! In T188561#4656191, @CCogdill_WMF wrote: > @Jgreen @BBlack thanks for bumping this and continuing to check. It does look like the SS...
[12:52:13] <wikibugs>	 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10elukey) a:05elukey>03None
[12:54:20] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) Linux 4.9.130-1 (which also contains the backport of the H840 Perc controller I made) has now been uploaded to "stretch-proposed-updates", the staging directory for p...
[12:54:29] <wikibugs>	 (03PS2) 10Elukey: profile::statistics::cruncher|private: remove unused bacula settings [puppet] - 10https://gerrit.wikimedia.org/r/454480 (https://phabricator.wikimedia.org/T201165)
[12:55:34] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::statistics::cruncher|private: remove unused bacula settings [puppet] - 10https://gerrit.wikimedia.org/r/454480 (https://phabricator.wikimedia.org/T201165) (owner: 10Elukey)
[12:55:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[12:59:42] <gehel>	 !log depooling wdqs1003 to catch up on lag
[12:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:46] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1048 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[13:12:50] <wikibugs>	 (03PS1) 10Jgreen: remove khorn from icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/466875
[13:13:46] <wikibugs>	 (03CR) 10Jgreen: [C: 032] remove khorn from icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/466875 (owner: 10Jgreen)
[13:15:56] <wikibugs>	 (03CR) 10Jgreen: [V: 032 C: 032] remove khorn from icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/466875 (owner: 10Jgreen)
[13:31:12] <wikibugs>	 (03PS1) 10Volans: cumin: enable known hosts backend in prod [puppet] - 10https://gerrit.wikimedia.org/r/466879 (https://phabricator.wikimedia.org/T206844)
[13:35:42] <gehel>	 !log repooling wdqs1003 catched up on lag
[13:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991)
[13:38:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:43:14] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991)
[13:45:00] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mcrouter: allow defining a non-default number of backend connectors [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786)
[13:48:18] <wikibugs>	 (03CR) 10Volans: "I've also updated the wikitech page:" [puppet] - 10https://gerrit.wikimedia.org/r/466879 (https://phabricator.wikimedia.org/T206844) (owner: 10Volans)
[13:54:12] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) So first, what I think might be the full root cause of everything: When we switched from codfw to eqiad the parser cach...
[13:55:23] <wikibugs>	 (03PS1) 10Banyek: wmf-pt-kill: logrotate feature added [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521)
[14:16:10] <icinga-wm>	 PROBLEM - Juniper alarms on cr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[14:16:20] <icinga-wm>	 PROBLEM - Host lvs5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:20] <icinga-wm>	 PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:21] <icinga-wm>	 PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:21] <icinga-wm>	 PROBLEM - Host mr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:21] <icinga-wm>	 PROBLEM - Host dns5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:21] <icinga-wm>	 PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:21] <icinga-wm>	 PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:22] <icinga-wm>	 PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:22] <icinga-wm>	 PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[14:16:52] <moritzm>	 XioNoX: ^ known issue?
[14:17:10] <icinga-wm>	 PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:10] <icinga-wm>	 PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:10] <icinga-wm>	 PROBLEM - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:10] <icinga-wm>	 PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:10] <icinga-wm>	 PROBLEM - Host lvs5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:11] <icinga-wm>	 PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:13] <XioNoX>	 wtf
[14:17:19] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:17:21] <XioNoX>	 seems like we lost our mgmt router?
[14:17:46] <_joe_>	 oh, wow
[14:18:05] <XioNoX>	 we lost one power feed
[14:18:09] <XioNoX>	 one PDU
[14:18:17] <XioNoX>	 we should depool eqsin
[14:18:17] <_joe_>	 to the whole rack?
[14:18:21] <_joe_>	 yes
[14:18:24] <_joe_>	 +1
[14:18:36] <XioNoX>	 2018-10-12 14:13:57 UTC  Major  PEM 0 Not OK on cr1-eqsin
[14:18:39] <icinga-wm>	 PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:18:55] <_joe_>	 yeah let's depool quickly
[14:19:09] <XioNoX>	 so no service interruption as far as I know but no power redundancy
[14:19:33] <XioNoX>	 _joe_: you're on it or should I?
[14:19:38] <_joe_>	 doing
[14:19:44] <XioNoX>	 cool, thx
[14:20:08] <XioNoX>	 the mgmt router only have 1 power supply
[14:20:10] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2)
[14:20:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/466890
[14:20:35] <_joe_>	 XioNoX: ^^
[14:20:45] <XioNoX>	 yeah, that's expected
[14:20:49] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:49] <icinga-wm>	 PROBLEM - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:49] <icinga-wm>	 PROBLEM - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:49] <icinga-wm>	 PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:49] <icinga-wm>	 PROBLEM - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:50] <icinga-wm>	 PROBLEM - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:20:59] <_joe_>	 no I mean my patch :P
[14:21:01] <XioNoX>	 mr1-eqsin only have 1 power supply so it went down
[14:21:05] <XioNoX>	 ah
[14:21:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/466890 (owner: 10Giuseppe Lavagetto)
[14:21:39] <XioNoX>	 _joe_: don't see the patch
[14:21:45] <_joe_>	 https://gerrit.wikimedia.org/r/466890
[14:21:59] <XioNoX>	 thx
[14:22:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/466890 (owner: 10Giuseppe Lavagetto)
[14:22:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 031] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/466890 (owner: 10Giuseppe Lavagetto)
[14:22:32] <XioNoX>	 _joe_: lgtm
[14:23:08] <_joe_>	 !log depooling eqsin via geodns due to loss of power redundancy
[14:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:21] <_joe_>	 {{done}}
[14:23:22] <XioNoX>	 _joe_: from the maint-announce list: "REMINDER - (SERVICE IMPACTING) Scheduled Customer Outage Level 6 UDU-6-EPR A2-01 Switchboard Modification at the SG3 IBX [5-167414185705]"
[14:23:34] <_joe_>	 scheduled?
[14:23:37] <_joe_>	 nice
[14:23:46] <_joe_>	 how did we not notice?
[14:24:05] <XioNoX>	 it's not in the calendar neither
[14:24:13] <_joe_>	 yeah
[14:24:20] <_joe_>	 when did that email arrive?
[14:24:33] <XioNoX>	 first one was on sept. 7
[14:24:38] <_joe_>	 nice
[14:24:45] <XioNoX>	 2nd one 1h20min ago
[14:24:49] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: 
[14:24:49] <icinga-wm>	 ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{dom
[14:24:49] <icinga-wm>	 -html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{dom
[14:25:09] <XioNoX>	 I'll open a task
[14:25:21] <XioNoX>	 UTC:	FRIDAY, 12 OCT 14:00 - SATURDAY, 13 OCT 00:00
[14:25:54] <XioNoX>	 moritzm, _joe_ thanks for the quick notification/action
[14:27:09] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[14:27:26] <_joe_>	 XioNoX: so I think that was the week when the person on clinic duty just forgot
[14:29:28] <wikibugs>	 (03CR) 10Imarlier: [C: 031] Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[14:30:09] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.64 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:32:26] <wikibugs>	 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10ayounsi) p:05Triage>03Normal
[14:32:40] <XioNoX>	 _joe_, moritzm, opened https://phabricator.wikimedia.org/T206861
[14:33:03] <wikibugs>	 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10ayounsi)
[14:33:54] <wikibugs>	 (03PS1) 10Jgreen: swap frauth2001 in for betelguese in nsca_frack_cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/466894
[14:34:01] <XioNoX>	 another question is why Equinix does service impacting power work at 10pm Friday local time...
[14:34:50] <wikibugs>	 (03CR) 10Jgreen: [C: 032] swap frauth2001 in for betelguese in nsca_frack_cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/466894 (owner: 10Jgreen)
[14:35:19] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:36:44] <icinga-wm>	 ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 12.62 le 60 Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[14:36:49] <herron>	 can we place the mgmt switch and other single psu gear there on a pdu with automatic transfer switch?
[14:37:29] <icinga-wm>	 ACKNOWLEDGEMENT - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861
[14:37:29] <icinga-wm>	 ACKNOWLEDGEMENT - Host mr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861
[14:37:29] <icinga-wm>	 ACKNOWLEDGEMENT - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861
[14:37:29] <icinga-wm>	 ACKNOWLEDGEMENT - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861
[14:37:29] <icinga-wm>	 ACKNOWLEDGEMENT - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2) Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861
[14:38:00] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:38:40] <icinga-wm>	 ACKNOWLEDGEMENT - Juniper alarms on cr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[14:38:40] <icinga-wm>	 ACKNOWLEDGEMENT - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:40:19] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:40:38] <XioNoX>	 herron: it's an option, but it also mean more expensive and complex PDUs that can become the SPOF
[14:40:41] <bblack>	 long ago
[14:40:49] <bblack>	 oops, ignore me
[14:41:25] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) With memcached wiped clean, and the parsercache databases basically void of useful content, almost all requests needed...
[14:41:49] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dns5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:43:29] <icinga-wm>	 PROBLEM - IPMI Sensor Status on dns5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:45:19] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) At the same time, a higher time for processing a single request meant that even in front of a substantially constant re...
[14:46:49] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:47:29] <icinga-wm>	 PROBLEM - IPMI Sensor Status on lvs5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:48:29] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:48:29] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:48:36] <bblack>	 lol
[14:48:52] <bblack>	 does redundancy count if it still manages to annoy us when we lose half the power? :)
[14:53:39] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical]
[14:54:40] <mark>	 yeah because I believe at least it hasn't annoyed any of our users? ;_
[14:54:50] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:54:50] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:54:50] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:55:00] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:59:10] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[14:59:40] <icinga-wm>	 PROBLEM - IPMI Sensor Status on bast5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[15:00:22] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]
[15:02:33] <icinga-wm>	 PROBLEM - IPMI Sensor Status on cp5006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical]
[15:05:05] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) As far as MediaWiki fatals go, we had way less issues than one would expect given the graphs above. We had only ~ 1000...
[15:05:56] <wikibugs>	 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) Overall the absence of any valid parsercache entries can explain all the effects we've seen, except at least partially...
[15:14:37] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on bast5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:37] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:37] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:37] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:37] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:37] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] Brandon Black eqsin power maintenance
[15:14:37] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] Brandon Black eqsin power maintenance
[15:14:38] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:38] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:39] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:39] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:14:40] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on cp5011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance
[15:15:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:04] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:05] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:05] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:06] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:06] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:15:07] <icinga-wm>	 ACKNOWLEDGEMENT - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance
[15:17:07] <_joe_>	 de-check "send notifications" next time :P
[15:17:43] <bblack>	 but then there's no irc-visible acknowledgement! :)
[15:33:12] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.07 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[15:38:15] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[15:39:23] <wikibugs>	 (03PS3) 10Cwhite: nutcracker: remove diamond collector resource [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454)
[15:40:28] <wikibugs>	 (03CR) 10Cwhite: [C: 032] nutcracker: remove diamond collector resource [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[15:48:31] <mutante>	 !log repair /dev/sdh1 on ms-be1043 - T199198
[15:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:35] <stashbot>	 T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198
[15:49:58] <wikibugs>	 (03PS1) 10Banyek: admin: Change banyek's .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/466901
[15:50:04] <mutante>	 !log repair /dev/sde1 on ms-be2041 - T199198
[15:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:28] <wikibugs>	 (03PS4) 10Dzahn: base/icinga: use monitoring_hosts constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782)
[15:52:44] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] base/icinga: use monitoring_hosts constant as NRPE allowed_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[15:53:42] <wikibugs>	 (03PS1) 10Cwhite: hiera: remove diamond on deployment_server role [puppet] - 10https://gerrit.wikimedia.org/r/466903 (https://phabricator.wikimedia.org/T183454)
[15:53:44] <wikibugs>	 (03PS1) 10Cwhite: hiera: remove diamond on dumps role [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454)
[15:53:46] <wikibugs>	 (03PS1) 10Cwhite: hiera: remove diamond from mediawiki role [puppet] - 10https://gerrit.wikimedia.org/r/466905 (https://phabricator.wikimedia.org/T183454)
[15:53:48] <wikibugs>	 (03PS1) 10Cwhite: hiera: remove diamond from scb role [puppet] - 10https://gerrit.wikimedia.org/r/466906 (https://phabricator.wikimedia.org/T183454)
[15:53:50] <wikibugs>	 (03PS1) 10Cwhite: hiera: remove diamond from thumbor role [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454)
[15:53:52] <wikibugs>	 (03PS1) 10Cwhite: hiera: remove diamond from wmcs role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454)
[15:55:35] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10DStrine) I'm relaying an email from Lisa:  Lisa Gruwell Thu, Oct 11, 6:13 PM (14 hours ago) to Jerry, Caitlin, me...
[15:58:40] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) Hi @jkim_wikimedia   there is one more thing, besides the SSH key, that we will need.  Please go to Wikitech...
[15:58:58] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) p:05Triage>03Normal
[15:59:28] <wikibugs>	 (03Abandoned) 10Cwhite: nutcracker: set diamond::remove on all roles containing nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/464918 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:01:03] <icinga-wm>	 RECOVERY - Filesystem available is greater than filesystem size on ms-be1043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops
[16:02:08] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10Dzahn) Hello @LarsWirzenius   Could you please:   - go to Wikitech wiki and create a user there , direct link: https://wikitech.wikimedia.org/w/index...
[16:02:25] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10Dzahn) p:05Triage>03Normal
[16:03:32] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:10:52] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:12:27] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) @JGreen @cwdent Could you advise how access requests for FRACK are usually handled from here? Do you also ma...
[16:17:03] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200)
[16:18:33] <icinga-wm>	 RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops
[16:19:13] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[16:19:22] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for Wikipedia & Education User Group - https://phabricator.wikimedia.org/T206566 (10Dzahn) You have successfully created the mailing list eduwiki and notification has been sent to the list owner dungodung@gmail.com. You can now:  [[ https://lists.w...
[16:20:31] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for Wikipedia & Education User Group - https://phabricator.wikimedia.org/T206566 (10Dzahn) 05Open>03Resolved a:03Dzahn
[16:22:54] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10Dzahn) I had sent a mail to list to speed this up because there was no SRE meeting this week. But i have not had any responses, s...
[16:28:23] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.08 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:28:32] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 77 probes of 343 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:29:52] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 77 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:30:33] <wikibugs>	 (03CR) 10Aaron Schulz: mcrouter: allow defining a non-default number of backend connectors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) (owner: 10Giuseppe Lavagetto)
[16:38:33] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 20 probes of 343 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:39:21] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) a:03cwdent
[16:39:53] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:44:09] <wikibugs>	 10Operations, 10monitoring: Setup metrics monitoring for OpenLDAP/corp - https://phabricator.wikimedia.org/T206327 (10Dzahn) p:05Triage>03Normal
[16:47:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[16:50:09] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Hi @Dzahn ! My user name is 'jkim'. Still working on the SSH key...
[16:50:43] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received
[16:50:56] <wikibugs>	 (03CR) 10Bstorm: "Looks good.  It doesn't break the tests.  There might be comments to update, though. Checking that." [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[16:51:43] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[16:53:53] <wikibugs>	 (03CR) 10Elukey: "We have also discussed what would be the best number of connections to open to each memcached shard for each mcrouter, and something like " [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) (owner: 10Giuseppe Lavagetto)
[16:54:12] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200)
[16:54:12] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200)
[16:54:17] <wikibugs>	 (03CR) 10Bstorm: [C: 031] "I may want to add an additional test case to the rspec for this, but I like it.  If you don't feel comfortable with rspec, please make me " [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey)
[16:55:12] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy
[16:55:13] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[16:58:33] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[17:05:02] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.32 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:15:54] <icinga-wm>	 PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:16:23] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 62 probes of 342 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[17:16:54] <icinga-wm>	 RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.008 second response time
[17:16:55] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.35 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:17:33] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 44 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[17:21:23] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 10 probes of 342 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[17:22:42] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 25 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[17:25:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200)
[17:25:15] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T206651 (10Dzahn) p:05Triage>03Normal
[17:25:30] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T206651 (10Dzahn)
[17:26:22] <wikibugs>	 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Convert automation scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T203943 (10Dzahn) p:05Triage>03Normal
[17:26:23] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[17:27:01] <wikibugs>	 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10Dzahn) p:05Triage>03High
[17:27:35] <wikibugs>	 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Dzahn) a:03Dzahn
[17:27:47] <wikibugs>	 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Dzahn) p:05Triage>03Normal
[17:28:37] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Jgreen) 05Open>03Resolved
[17:31:34] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Jgreen)
[17:37:10] <icinga-wm>	 RECOVERY - puppet last run on wdqs1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:40:24] <wikibugs>	 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) I have replied by email and referred to this ticket. I also pointed out we don't agree that this check is even useful (but that it was fixed nevertheless)....
[17:41:05] <wikibugs>	 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) 05Open>03Resolved Thanks again. As always it's greatly appreciated that you add all the technical details and background.
[17:41:39] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Dzahn) p:05Triage>03Normal
[17:42:14] <wikibugs>	 10Operations, 10ops-codfw, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Dzahn) https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Decommissioned
[17:42:17] <wikibugs>	 (03PS1) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782)
[17:43:04] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Dzahn)
[17:43:50] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Dzahn)
[17:43:59] <wikibugs>	 (03PS2) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782)
[17:46:47] <wikibugs>	 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10Dzahn) p:05Triage>03Normal
[17:48:18] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Change digest function of wikimedia-l@ so it send emails only once a day - https://phabricator.wikimedia.org/T141566 (10Dzahn) 05stalled>03declined
[17:51:13] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) Hi @jkim_wikimedia - sorry for the confusion, I'll be making this account for you.  Do you have a yubikey?
[17:54:00] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[17:54:21] <wikibugs>	 (03PS3) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639)
[17:56:10] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[17:58:23] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) affected lists as of today:  [[ https://lists.wikimedia.org/mailman/listinfo/affcom-members | Affcom-members ]] [[ https://lists.wikimedia.org/mailman/listinfo/betacluster-a...
[18:02:12] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Jgreen)
[18:04:45] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances1-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) p:05Triage>03High
[18:06:21] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10aborrero)
[18:06:39] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) So the vlans show:   ``` default-switch          cloud-instances1-b-eqiad 1102...
[18:07:09] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[18:12:34] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) So to show the output:   ``` robh@asw2-b-eqiad> show interfaces descriptions | grep cloudvirt1023  ge-1/0/8        up    u...
[18:14:10] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 66 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[18:15:24] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10Papaul) ```  papaul@asw2-b-eqiad# ...eqiad unit 0 family ethernet-switching vlan members cloud-     {master:2}[edit] papaul@asw2...
[18:20:25] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) So, the above from @papaul would add in the vlan a second time, since hte current config already shows:   ``` interface-ra...
[18:25:26] <addshore>	 !log running modified attachLatest.php script over ~9000 pages on wikidatawiki (with added wait for slaves) T206743
[18:25:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:29] <stashbot>	 T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743
[18:32:31] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) Automatic mail to primary list admins of all lists without description sent with:  ``` for list in $(/var/lib/mailman/bin/list_lists | grep "no description"  | sed 's/ - \[n...
[18:34:05] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) ``` mailing primary admin of Ac-temp to set a description .. mailing primary admin of Advisory to set a description mailing primary admin of Affcom-members to set a descript...
[18:34:15] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) >>! In T206872#4661995, @Papaul wrote: > ```  > papaul@asw2-b-eqiad# ...eqiad unit 0 family ethernet-switching vlan member...
[18:34:20] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[18:34:36] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) p:05Triage>03Low
[18:37:06] <addshore>	 !log modified attachLatest.php script finished running over 9395 pages T206743
[18:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:10] <stashbot>	 T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743
[18:37:55] <wikibugs>	 (03PS2) 10Dzahn: network::constants: remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465686 (https://phabricator.wikimedia.org/T201343)
[18:41:39] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[18:50:07] <wikibugs>	 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Jdforrester-WMF)
[18:56:09] <brion>	 !log restarted vp9 background transcodes in eqiad, via mwmaint1002
[18:56:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:02:11] <wikibugs>	 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 (10Dzahn) bpfilter made it into 4.18 kernel and there are claims that it would "eventually replace both iptables and nftables"
[19:04:06] <wikibugs>	 (03CR) 10Dzahn: [C: 032] network::constants: remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465686 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[19:08:12] <wikibugs>	 (03PS1) 10Dzahn: DHCP: fix mwmaint1001 -> mw1297 fixed address [puppet] - 10https://gerrit.wikimedia.org/r/466947 (https://phabricator.wikimedia.org/T192457)
[19:09:06] <wikibugs>	 (03CR) 10Dzahn: [C: 032] DHCP: fix mwmaint1001 -> mw1297 fixed address [puppet] - 10https://gerrit.wikimedia.org/r/466947 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[19:12:00] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[19:15:40] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on heze is CRITICAL: cluster=misc device=megaraid,8 instance=heze:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=heze&var-datasource=codfw%2520prometheus%252Fops
[19:19:19] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[19:32:38] <wikibugs>	 (03PS1) 10Dzahn: nagios_common: add stretch support to check_ssl [puppet] - 10https://gerrit.wikimedia.org/r/466951
[19:37:34] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10faidon)
[19:37:38] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10faidon) 05Open>03Resolved a:03faidon FWIW, this was sorted out for both cloudvirt1023 and cloudvirt1024 in exactly the way...
[19:38:22] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10Andrew) thanks all!
[19:42:39] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10Andrew) Would you expect us to need to reinstall the OS (or otherwise make interface changes) after this change?
[19:43:01] <icinga-wm>	 PROBLEM - Host cloudvirt1019 is DOWN: PING CRITICAL - Packet loss = 100%
[19:43:25] <paladox>	 andrewbogott ^^
[19:43:50] <andrewbogott>	 paladox: I think Chris is still working on that one
[19:43:55] <paladox>	 oh i see
[19:48:59] <cmjohnson1>	 Yes, I’m sorry I did not extend the downtime. I’m waiting on next steps from HP
[19:51:33] <wikibugs>	 (03PS2) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981)
[19:51:36] <wikibugs>	 (03PS2) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981)
[19:51:37] <wikibugs>	 (03PS2) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444)
[19:51:39] <wikibugs>	 (03PS2) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981)
[19:51:41] <wikibugs>	 (03PS1) 10Jforrester: Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981)
[19:51:43] <wikibugs>	 (03PS1) 10Jforrester: Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981)
[19:51:45] <wikibugs>	 (03PS1) 10Jforrester: Enable WikibaseMediaInfo on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708)
[19:51:50] <icinga-wm>	 PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2510 MB (5% inode=67%)
[19:52:27] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Blocked on Security Review sign-off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester)
[19:52:34] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Blocked on Security Review sign-off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester)
[19:52:46] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Filled with FIXMEs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester)
[19:52:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester)
[19:53:06] <wikibugs>	 (03CR) 10Jforrester: "Blocked on Security Review sign-off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester)
[19:53:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester)
[19:53:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester)
[19:53:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester)
[19:53:38] <wikibugs>	 (03Abandoned) 10Jforrester: Enable the WikibaseMediaInfo extension in Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446845 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester)
[19:53:52] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Absolutely not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708) (owner: 10Jforrester)
[19:53:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable WikibaseMediaInfo on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708) (owner: 10Jforrester)
[19:54:00] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) I wouldn't think so, nope.  The OS install doesn't really make use of eth1...
[19:58:47] <wikibugs>	 10Operations, 10WMF-Blog-Social-Team, 10Wikimedia-Mailing-lists: Request mailman list "worldcup2018" for upcoming affiliate campaign - https://phabricator.wikimedia.org/T196003 (10Aklapper)
[19:58:59] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Aklapper) >>! In T179568#4661941, @Dzahn wrote: > https://lists.wikimedia.org/mailman/listinfo/worldcup2018   (How is this even remotely Wikimedia related?)  @dzahn: T196003; maybe...
[19:59:30] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] Install but don't enable the WikibaseMediaInfo extension, part III (038 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester)
[20:00:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get)
[20:00:21] <icinga-wm>	 PROBLEM - Nginx local proxy to jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[20:00:24] <wikibugs>	 (03PS2) 10Dzahn: nagios_common: add stretch support to check_ssl [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782)
[20:01:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy
[20:01:19] <icinga-wm>	 RECOVERY - Nginx local proxy to jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.077 second response time
[20:02:50] <wikibugs>	 (03PS3) 10Dzahn: icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782)
[20:03:11] <wikibugs>	 (03PS3) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444)
[20:03:13] <wikibugs>	 (03PS2) 10Jforrester: Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981)
[20:03:15] <wikibugs>	 (03PS2) 10Jforrester: Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981)
[20:03:17] <wikibugs>	 (03PS3) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981)
[20:03:19] <wikibugs>	 (03PS2) 10Jforrester: Enable WikibaseMediaInfo on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708)
[20:03:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[20:10:00] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:11:31] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10aborrero)
[20:11:39] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10aborrero) 05Resolved>03Open This still doesn't work. Same behavior as before: packets go out of the interface but nothing en...
[20:13:42] <wikibugs>	 10Operations, 10Traffic: certcentral: check for SCTs, with optional disable per-account - https://phabricator.wikimedia.org/T206876 (10BBlack) p:05Triage>03Normal
[20:14:39] <icinga-wm>	 RECOVERY - Disk space on contint1001 is OK: DISK OK
[20:14:52] <mutante>	 ^ i gzipped a bunch of zuul logs for that 
[20:15:04] <mutante>	 we didn't want jerkins to run out of disk
[20:17:19] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 56 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:29:15] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Hi @cwdent - I do!
[20:32:25] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) @Aklapper thanks. i didn't know the "for affiliate campaign" part of it. That makes it work related.
[20:40:15] <wikibugs>	 (03CR) 10Smalyshev: wdqs: cleanup logback configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel)
[20:40:52] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: PHP fatal error only in one portal in arwiki - https://phabricator.wikimedia.org/T206878 (10Reedy)
[20:41:00] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: PHP fatal error only in one portal in arwiki - https://phabricator.wikimedia.org/T206878 (10alanajjar)
[20:44:10] <icinga-wm>	 RECOVERY - IPMI Sensor Status on dns5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[20:44:11] <wikibugs>	 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 (10aborrero) From https://blogs.gnome.org/dcbw/2018/07/27/the-ascendance-of-nftables/ :  ``` What about eBPF?  You might have heard that eBPF will replace everything and give everyone a unicorn. It might, if...
[20:47:39] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[20:48:10] <icinga-wm>	 RECOVERY - IPMI Sensor Status on lvs5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[20:48:31] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: PHP fatal error only in one portal in arwiki - https://phabricator.wikimedia.org/T206878 (10alanajjar) >Note: one user said that he can open it, but can't save anything in this page.  Sorry for the bad quality of this image, but this user faced thi...
[20:49:20] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[20:49:20] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[20:54:02] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS disk usage increase is correlated with reloading of categories - https://phabricator.wikimedia.org/T200202 (10Smalyshev) 05Open>03Resolved a:03Smalyshev Does not happen anymore since we're using dailies.
[20:54:30] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[20:55:40] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[20:55:40] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[20:55:40] <wikibugs>	 (03CR) 10Cwhite: nagios_common: add stretch support to check_ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[20:56:08] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10Papaul) I checked the other cloudvirt node that are working (1021,1020 and 1022) all of their eth1 is part of the interface-rang...
[21:00:00] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[21:03:19] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[21:06:10] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[21:10:12] <wikibugs>	 (03PS1) 10Cwhite: phabricator: remove custom diamond::collector [puppet] - 10https://gerrit.wikimedia.org/r/466988 (https://phabricator.wikimedia.org/T183454)
[21:10:14] <wikibugs>	 (03PS1) 10Cwhite: phabricator: remove diamond::collector and purge diamond [puppet] - 10https://gerrit.wikimedia.org/r/466989 (https://phabricator.wikimedia.org/T183454)
[21:11:00] <icinga-wm>	 RECOVERY - IPMI Sensor Status on cp5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK
[21:11:06] <wikibugs>	 (03PS1) 10Bstorm: labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990
[21:13:52] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: arwiki page giving "entire web request took longer than 60 seconds and timed out" - https://phabricator.wikimedia.org/T206878 (10Reedy)
[21:17:22] <wikibugs>	 (03PS2) 10Bstorm: labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990
[21:29:01] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite)
[21:31:43] <wikibugs>	 (03PS3) 10Bstorm: labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990
[21:33:20] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:37:06] <wikibugs>	 (03CR) 10Bstorm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/466990 (owner: 10Bstorm)
[21:37:52] <wikibugs>	 (03CR) 10Bstorm: "The puppet compiler seems ok with this approach on the existing jessie servers." [puppet] - 10https://gerrit.wikimedia.org/r/466990 (owner: 10Bstorm)
[21:38:08] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/466990 (owner: 10Bstorm)
[21:40:40] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 58 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:57:56] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez)
[22:00:53] <wikibugs>	 (03CR) 10Bstorm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm)
[22:01:07] <wikibugs>	 (03PS24) 10Bstorm: WIP toolforge: write/move a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557)
[22:06:11] <wikibugs>	 (03CR) 10Gehel: wdqs: cleanup logback configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel)
[22:07:00] <wikibugs>	 (03PS8) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563)
[22:09:00] <icinga-wm>	 PROBLEM - Memory correctable errors -EDAC- on db1069 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops
[22:10:47] <wikibugs>	 (03CR) 10Bstorm: [C: 032] WIP toolforge: write/move a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm)
[22:13:36] <wikibugs>	 (03CR) 10Smalyshev: wdqs: cleanup logback configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel)
[22:14:42] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] "I have a proposal for an easier solution: libmonitoring-plugin-perl is available in jessie-backports, so just ensure => installed that on " [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[22:21:07] <wikibugs>	 10Operations, 10Cloud-VPS (Ubuntu Trusty Deprecation): cloudvps: toolserver-legacy project trusty deprecation - https://phabricator.wikimedia.org/T204564 (10bd808)
[22:21:26] <wikibugs>	 10Puppet, 10Cloud-VPS (Ubuntu Trusty Deprecation): cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10bd808)
[22:36:21] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[22:42:38] <wikibugs>	 (03PS1) 10Smalyshev: My tests show that Kafka poller behaves much better with -b 700 [puppet] - 10https://gerrit.wikimedia.org/r/467002
[22:43:40] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[23:33:03] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10faidon)
[23:33:12] <wikibugs>	 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10faidon) 05Open>03Resolved >>! In T206872#4662465, @Papaul wrote: > I checked the other cloudvirt node that are working (1021...
[23:34:29] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[23:41:40] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts