[00:05:59] (03PS3) 10Dzahn: base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) [00:08:27] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12880/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:09:11] (03CR) 10Dzahn: "per the suggestion above. but yea, it still has the different format as mentioned by Alex" [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:10:33] (03CR) 10Dzahn: [C: 04-1] "needs more string mangling, ack" [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:13:36] Hi, is this the good place for chatting/asking questions regarding contributing to wikimedia or is there another channel ? [00:14:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 60 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [00:16:43] thifranc: it's not the worst one :) feel free to ask me.. just about to leave and be back in a little while [00:17:01] there is also #wikimedia-tech i guess both can work [00:17:27] there are many more but it depends what you want to work on [00:19:20] thifranc: https://meta.wikimedia.org/wiki/IRC/Channels#General_development_and_technical_discussion [00:19:25] Well I've only got the basic setup, which gerrit-ldap-phabricator accounts, I'm more into sysadmin, I've read some docs I've encountered on wikimedia, but I don't really know where to start [00:20:13] not wrong here for sysadmin stuff [00:20:23] we use puppet for everything [00:22:30] thifranc: you would want to start with a a user on wikitech.wikimedia.org [00:22:37] I use ansible for work, do you think I should first play with puppet before reading wikimedia code ? [00:22:39] oh, you already have that i guess [00:22:57] yea, you can clone the puppet repo and take a look, it's all public [00:23:11] and see https://wikitech.wikimedia.org/wiki/Puppet_coding [00:23:24] it starts with how to clone [00:23:34] and continues to talk about puppet style [00:23:45] hmm yeah I've read this page about puppet formatting [00:24:50] but on phabricator, are tickets sysadmin related tagged specifically ? [00:25:23] thifranc: yea, a common one is https://phabricator.wikimedia.org/tag/operations/ [00:25:54] and is there any resource explaining like the basics of wikimedia infras ? [00:27:38] in general, the wikitech wiki as a whole [00:27:42] there is also https://en.wikipedia.org/wiki/Wikimedia_Foundation#Technology [00:28:08] https://wikitech.wikimedia.org/wiki/File:Infrastructure_overview.png [00:32:38] awesome ! thks for those schema, that's what I was looking for [02:00:08] PROBLEM - Filesystem available is greater than filesystem size on ms-be1043 is CRITICAL: cluster=swift device=/dev/sdh1 fstype=xfs instance=ms-be1043:9100 job=node mountpoint=/srv/swift-storage/sdh1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops [02:05:07] (03PS1) 10Andrew Bogott: Make cloudvirt1023 a compute node [puppet] - 10https://gerrit.wikimedia.org/r/466817 (https://phabricator.wikimedia.org/T199125) [02:09:14] (03CR) 10Andrew Bogott: [C: 032] Make cloudvirt1023 a compute node [puppet] - 10https://gerrit.wikimedia.org/r/466817 (https://phabricator.wikimedia.org/T199125) (owner: 10Andrew Bogott) [02:13:47] (03PS1) 10Andrew Bogott: Define profile::openstack::base::nova::instance_dev for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/466818 [02:15:04] (03CR) 10Andrew Bogott: [C: 032] Define profile::openstack::base::nova::instance_dev for cloudvirt1023 [puppet] - 10https://gerrit.wikimedia.org/r/466818 (owner: 10Andrew Bogott) [02:31:48] PROBLEM - puppet last run on cloudvirt1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:36:24] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle) > The MediaWiki eqiad-appserver cluster **gasping for air**, | {F26543386 height=300} | //[figure 1.](https://grafana.wikimedia.org/dash... [02:37:29] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle) [02:46:48] RECOVERY - puppet last run on cloudvirt1023 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [03:06:03] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) > If update lag is not a big issue for our users, then we should make it clear... [03:16:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:17:58] (03PS2) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [03:24:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 48 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [03:30:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 930.40 seconds [03:53:07] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 226.74 seconds [04:06:41] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > Our production instance consume ~3-4K IOPS just for updates. To clarify here, this would just... [04:12:50] 10Operations, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786 (10BPirkle) [04:53:17] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured im [04:53:17] 29, 2016 returned the unexpected status 404 (expecting: 200) [04:54:27] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [05:08:27] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200) [05:09:36] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [05:23:07] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1173 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [05:46:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:53:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 48 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:54:13] !log installing git security updates on trusty [05:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:46] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [05:59:00] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:05:04] (03CR) 10Muehlenhoff: [C: 031] "Looks good, PCC is also fine: https://puppet-compiler.wmflabs.org/compiler1002/12881/" [puppet] - 10https://gerrit.wikimedia.org/r/466696 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [06:05:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:27:43] (03PS3) 10Muehlenhoff: Cleanup systemd state on Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454) [06:29:06] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/systemd/system/nginx.service.d/security.conf] [06:32:28] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:49:27] (03CR) 10Muehlenhoff: [C: 032] Cleanup systemd state on Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [06:57:58] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:27] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:04] (03PS1) 10Muehlenhoff: Remove diamonn on webserver_misc_static role [puppet] - 10https://gerrit.wikimedia.org/r/466825 (https://phabricator.wikimedia.org/T183454) [07:00:40] (03PS2) 10Muehlenhoff: Remove diamond on webserver_misc_static role [puppet] - 10https://gerrit.wikimedia.org/r/466825 (https://phabricator.wikimedia.org/T183454) [07:02:38] (03PS3) 10Muehlenhoff: Remove diamond on webserver_misc_static role [puppet] - 10https://gerrit.wikimedia.org/r/466825 (https://phabricator.wikimedia.org/T183454) [07:04:44] (03CR) 10Muehlenhoff: [C: 032] Remove diamond on webserver_misc_static role [puppet] - 10https://gerrit.wikimedia.org/r/466825 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [07:09:08] (03CR) 10KartikMistry: [C: 04-1] "To be uploaded only with new apertium package." [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/465932 (https://phabricator.wikimedia.org/T206439) (owner: 10KartikMistry) [07:10:49] (03PS3) 10Muehlenhoff: Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454) [07:12:33] gilles: o/ [07:12:43] (03PS4) 10Muehlenhoff: Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454) [07:12:45] do you know if we plot somewhere thumbor's memcache stats? [07:13:21] elukey: I'm not sure [07:13:28] we might not [07:14:01] because I am upgrading the prometheus exporter and I was wondering where to check metrics [07:14:07] maybe we collect only mc* [07:14:38] yeah seems so [07:19:02] (03PS1) 10Elukey: role::prometheus::ops: collect memcached stats from thumbor/swift [puppet] - 10https://gerrit.wikimedia.org/r/466828 [07:19:42] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) >>! In T183454#4657874, @MoritzMuehlenhoff wrote: > Instead we can simply workaround this in puppet: https://gerrit.wikimedia.... [07:20:13] (03PS2) 10Muehlenhoff: Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/465413 [07:23:10] (03CR) 10Muehlenhoff: [C: 032] Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/465413 (owner: 10Muehlenhoff) [07:26:23] (03CR) 10Elukey: "So https://puppet-compiler.wmflabs.org/compiler1002/12882/ looks good but I am not sure if the new labels for the memcached cluster implie" [puppet] - 10https://gerrit.wikimedia.org/r/466828 (owner: 10Elukey) [07:37:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [07:37:54] (03PS1) 10Muehlenhoff: Remove sarin/neodymium from network constants/tcpircbot config [puppet] - 10https://gerrit.wikimedia.org/r/466830 [07:39:14] (03CR) 10Fomafix: "Since d59f27aeab08b171e5ab6a081e763a4cad0bca04 the variants sr-cyrl and sr-latn are already supported." [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [07:40:58] (03PS1) 10Muehlenhoff: Remove sarin/neodymium from grant/mysql root hosts [puppet] - 10https://gerrit.wikimedia.org/r/466833 [07:43:16] PROBLEM - High lag on wdqs1003 is CRITICAL: 3656 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:44:36] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [08:06:26] 10Operations, 10cloud-services-team: Create a jessie netboot image with the 4.9 Linux kernel - https://phabricator.wikimedia.org/T206761 (10MoritzMuehlenhoff) The current set of cloudvirt* installed fine with the stock jessie 3.16 kernel as the 10G NIC is disabled, so this is not needed immediately. We can sti... [08:11:28] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10elukey) 05Open>03Resolved [08:11:52] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) 05Open>03Resolved [08:22:57] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 48.85 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:25:35] (03PS2) 10Elukey: admin: add turnilo and superset sudo privs to analytics-admin group [puppet] - 10https://gerrit.wikimedia.org/r/464831 (https://phabricator.wikimedia.org/T206217) (owner: 10Herron) [08:28:27] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 71.22 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:33:56] 10Operations, 10DBA, 10MediaWiki-Database, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) The binlog purging on codfw was started yesterday (sorry I didn't logged it here), and it runs since; the replication works, and the disk... [08:48:34] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) The ones I have seen are relatively short burst of errors in error counters in interfaces on WDQS (`node_network_receive... [08:52:37] PROBLEM - HHVM jobrunner on mw1303 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:53:46] RECOVERY - HHVM jobrunner on mw1303 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.008 second response time [08:58:13] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10MoritzMuehlenhoff) When you're initiating an install, do you see a GET for autoinstall/preseed.cfg showing up in ng... [09:01:45] !log rolling restart of eventbus on kafka[1,2]00[1-3] to pick up python security upgrades [09:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:51] Cc: mobrovac --^ [09:05:59] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) [09:17:26] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for J [09:17:27] h aggregated=true) returned the unexpected status 504 (expecting: 200) [09:18:07] (03PS4) 10Muehlenhoff: Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) [09:18:27] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [09:18:53] (03CR) 10Muehlenhoff: [C: 032] Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:26:17] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) First, the timeline: - Internal traffic starts flowing through eqiad in the interval 14:14:44 - 14:15:03 - External traffic... [09:32:56] PROBLEM - HHVM rendering on mw1342 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [09:32:57] PROBLEM - Apache HTTP on mw1342 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.005 second response time [09:33:57] RECOVERY - HHVM rendering on mw1342 is OK: HTTP OK: HTTP/1.1 200 OK - 81961 bytes in 0.143 second response time [09:34:06] RECOVERY - Apache HTTP on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.028 second response time [09:47:15] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10akosiaris) [09:49:01] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiasco post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) [09:51:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [09:51:36] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) [09:52:07] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-coal [puppet] - 10https://gerrit.wikimedia.org/r/465593 (https://phabricator.wikimedia.org/T135991) [09:54:16] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for uwsgi-coal [puppet] - 10https://gerrit.wikimedia.org/r/465593 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:56:08] (03CR) 10Volans: [C: 032] Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [09:57:30] (03Merged) 10jenkins-bot: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [09:58:03] jouncebot: next [09:58:03] In 72 hour(s) and 31 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T1030) [09:58:08] oh, its friday duh [09:58:52] * addshore is going to backport https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/466843/ as that version of the eventlogging schema is needed for something in 2 hours [10:01:40] (03PS3) 10Volans: cookbook: split main() into parse_args() and run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) [10:01:46] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:02:13] (03CR) 10Volans: "Re-open for final review and ready to be merged" [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:03:17] (03PS3) 10Volans: etcd-config: add check for directory [puppet] - 10https://gerrit.wikimedia.org/r/465197 (https://phabricator.wikimedia.org/T199413) [10:04:26] (03CR) 10Volans: [C: 032] etcd-config: add check for directory [puppet] - 10https://gerrit.wikimedia.org/r/465197 (https://phabricator.wikimedia.org/T199413) (owner: 10Volans) [10:05:07] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for [10:05:07] turned the unexpected status 504 (expecting: 200): /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 404 (expecting: 200): /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections [10:05:07] ected status 504 (expecting: 200) [10:06:16] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [10:06:54] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup) [10:07:02] (03PS4) 10Alexandros Kosiaris: ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup) [10:07:05] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup) [10:08:07] (03CR) 10Mark Bergsma: cookbook: split main() into parse_args() and run() (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/458115 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:08:41] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/WikimediaEvents/extension.json: T205283 [[gerrit:466843]] Update Schema:WMDEBannerEvents rev to 18437830 (duration: 00m 52s) [10:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:44] T205283: Measure usage of banner slider functionality - https://phabricator.wikimedia.org/T205283 [10:09:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [10:09:37] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: [10:09:37] ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{dom [10:09:37] -html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{dom [10:10:47] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [10:12:42] (03PS1) 10Banyek: mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) [10:13:57] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: [10:13:57] ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{dom [10:13:57] -html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{dom [10:15:07] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [10:17:37] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [10:18:37] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [10:21:00] (03PS1) 10Banyek: mariadb: produtionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) [10:21:58] (03PS2) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) [10:22:39] (03CR) 10Muehlenhoff: "Picking up various comments together:" [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:22:59] (03CR) 10jerkins-bot: [V: 04-1] mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [10:23:17] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 201 [10:23:17] xpected value at path = Missing keys: [umostread, uimage] [10:24:12] (03PS3) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) [10:24:26] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [10:27:39] !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds [10:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:42] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [10:31:27] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: [10:31:27] ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected st [10:31:27] : 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 [10:31:54] (03CR) 10Marostegui: [C: 04-1] "Missing the IP entries" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [10:32:28] (03CR) 10Marostegui: [C: 04-1] "Missing the host yaml, the shard, the binlog format." [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [10:33:30] (03CR) 10Banyek: "ok, I'll check them!" [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) (owner: 10Banyek) [10:34:07] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the [10:34:07] for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 504 (expecting: 200) [10:34:59] <_joe_> 504 means restbase timed out before getting a response I guess? [10:35:17] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [10:35:37] <_joe_> it's probably just scb1004 [10:36:23] I was about to ask the same [10:37:29] <_joe_> all endpoints healthy now [10:37:56] <_joe_> it's probably some slowness in the API then, seeing restbase had the same issue [10:38:07] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [10:39:09] <_joe_> yes https://grafana.wikimedia.org/dashboard/db/restbase?panelId=12&fullscreen&orgId=1&from=now-1h&to=now [10:51:06] <_joe_> !log depooling mw2252 for mcrouter tests T203786 [10:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:10] T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [10:52:45] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for nginx on debug proxies [puppet] - 10https://gerrit.wikimedia.org/r/466852 (https://phabricator.wikimedia.org/T135991) [10:56:08] (03PS2) 10Banyek: mariadb: productionize db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466846 (https://phabricator.wikimedia.org/T206593) [11:10:55] 10Operations, 10Traffic: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) Also remembering there's some stats @Krinkle mentioned in T196248 about clock skews and users from some Google research. The TL;DR there was 24 hours gives us 93.3% , and 5 days is the sweet spot g... [11:15:16] 10Operations, 10Traffic: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) [11:18:28] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991) [11:19:02] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:24:41] (03PS1) 10Banyek: mariadb: reimage db2096 [puppet] - 10https://gerrit.wikimedia.org/r/466856 (https://phabricator.wikimedia.org/T206593) [11:27:11] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991) [11:29:51] (03PS4) 10Banyek: mariadb: productionize db2096 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466847 (https://phabricator.wikimedia.org/T206593) [11:31:40] (03PS3) 10Muehlenhoff: rsyncd: Add option to generate ferm rules based on $hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/465378 [11:55:50] jouncebot: next [11:55:50] In 70 hour(s) and 34 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181015T1030) [11:56:12] (03CR) 10Muehlenhoff: [C: 032] rsyncd: Add option to generate ferm rules based on $hosts_allow [puppet] - 10https://gerrit.wikimedia.org/r/465378 (owner: 10Muehlenhoff) [11:56:53] 10Operations, 10Wikimedia-Logstash, 10monitoring, 10Patch-For-Review, and 3 others: Send celery and wsgi service logs to logstash - https://phabricator.wikimedia.org/T181630 (10Ladsgroup) It works: https://logstash-beta.wmflabs.org/goto/f9a3fea8c95c02724813fb7cbebb6471 Shall we go prod? let's do it on Monday. [12:06:59] (03PS1) 10Muehlenhoff: Switch yhsm_aead_sync to auto_ferm [puppet] - 10https://gerrit.wikimedia.org/r/466867 [12:19:04] (03PS2) 10Elukey: Refactor type Systemd::Timer::DateTime to include more normal forms [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) [12:19:30] (03CR) 10Elukey: "any thought? @BStorm? :)" [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [12:20:33] !log uploading gdnsd 2.99.9942-beta-1+wmf1 to stretch-wikimedia [12:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:24] (03PS2) 10Zoranzoki21: Add throttle rule and remove outdated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) [12:36:27] RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [12:43:36] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [12:44:05] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Jgreen) >>! In T188561#4656191, @CCogdill_WMF wrote: > @Jgreen @BBlack thanks for bumping this and continuing to check. It does look like the SS... [12:52:13] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10elukey) a:05elukey>03None [12:54:20] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 (10MoritzMuehlenhoff) Linux 4.9.130-1 (which also contains the backport of the H840 Perc controller I made) has now been uploaded to "stretch-proposed-updates", the staging directory for p... [12:54:29] (03PS2) 10Elukey: profile::statistics::cruncher|private: remove unused bacula settings [puppet] - 10https://gerrit.wikimedia.org/r/454480 (https://phabricator.wikimedia.org/T201165) [12:55:34] (03CR) 10Elukey: [C: 032] profile::statistics::cruncher|private: remove unused bacula settings [puppet] - 10https://gerrit.wikimedia.org/r/454480 (https://phabricator.wikimedia.org/T201165) (owner: 10Elukey) [12:55:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [12:59:42] !log depooling wdqs1003 to catch up on lag [12:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:46] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1048 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:12:50] (03PS1) 10Jgreen: remove khorn from icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/466875 [13:13:46] (03CR) 10Jgreen: [C: 032] remove khorn from icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/466875 (owner: 10Jgreen) [13:15:56] (03CR) 10Jgreen: [V: 032 C: 032] remove khorn from icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/466875 (owner: 10Jgreen) [13:31:12] (03PS1) 10Volans: cumin: enable known hosts backend in prod [puppet] - 10https://gerrit.wikimedia.org/r/466879 (https://phabricator.wikimedia.org/T206844) [13:35:42] !log repooling wdqs1003 catched up on lag [13:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:21] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991) [13:38:00] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:43:14] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for squid/url downloaders [puppet] - 10https://gerrit.wikimedia.org/r/466880 (https://phabricator.wikimedia.org/T135991) [13:45:00] (03PS1) 10Giuseppe Lavagetto: mcrouter: allow defining a non-default number of backend connectors [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) [13:48:18] (03CR) 10Volans: "I've also updated the wikitech page:" [puppet] - 10https://gerrit.wikimedia.org/r/466879 (https://phabricator.wikimedia.org/T206844) (owner: 10Volans) [13:54:12] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) So first, what I think might be the full root cause of everything: When we switched from codfw to eqiad the parser cach... [13:55:23] (03PS1) 10Banyek: wmf-pt-kill: logrotate feature added [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/466886 (https://phabricator.wikimedia.org/T206521) [14:16:10] PROBLEM - Juniper alarms on cr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:16:20] PROBLEM - Host lvs5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:16:20] PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:16:21] PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:16:21] PROBLEM - Host mr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [14:16:21] PROBLEM - Host dns5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:16:21] PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:16:21] PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:16:22] PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:16:22] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [14:16:52] XioNoX: ^ known issue? [14:17:10] PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:17:10] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:17:10] PROBLEM - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:17:10] PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:17:10] PROBLEM - Host lvs5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:17:11] PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:17:13] wtf [14:17:19] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:17:21] seems like we lost our mgmt router? [14:17:46] <_joe_> oh, wow [14:18:05] we lost one power feed [14:18:09] one PDU [14:18:17] we should depool eqsin [14:18:17] <_joe_> to the whole rack? [14:18:21] <_joe_> yes [14:18:24] <_joe_> +1 [14:18:36] 2018-10-12 14:13:57 UTC Major PEM 0 Not OK on cr1-eqsin [14:18:39] PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:18:55] <_joe_> yeah let's depool quickly [14:19:09] so no service interruption as far as I know but no power redundancy [14:19:33] _joe_: you're on it or should I? [14:19:38] <_joe_> doing [14:19:44] cool, thx [14:20:08] the mgmt router only have 1 power supply [14:20:10] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2) [14:20:29] (03PS1) 10Giuseppe Lavagetto: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/466890 [14:20:35] <_joe_> XioNoX: ^^ [14:20:45] yeah, that's expected [14:20:49] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [14:20:49] PROBLEM - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:20:49] PROBLEM - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:20:49] PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:20:49] PROBLEM - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:20:50] PROBLEM - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:20:59] <_joe_> no I mean my patch :P [14:21:01] mr1-eqsin only have 1 power supply so it went down [14:21:05] ah [14:21:33] (03CR) 10Muehlenhoff: [C: 031] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/466890 (owner: 10Giuseppe Lavagetto) [14:21:39] _joe_: don't see the patch [14:21:45] <_joe_> https://gerrit.wikimedia.org/r/466890 [14:21:59] thx [14:22:20] (03CR) 10Giuseppe Lavagetto: [C: 032] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/466890 (owner: 10Giuseppe Lavagetto) [14:22:26] (03CR) 10Ayounsi: [C: 031] Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/466890 (owner: 10Giuseppe Lavagetto) [14:22:32] _joe_: lgtm [14:23:08] <_joe_> !log depooling eqsin via geodns due to loss of power redundancy [14:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:21] <_joe_> {{done}} [14:23:22] _joe_: from the maint-announce list: "REMINDER - (SERVICE IMPACTING) Scheduled Customer Outage Level 6 UDU-6-EPR A2-01 Switchboard Modification at the SG3 IBX [5-167414185705]" [14:23:34] <_joe_> scheduled? [14:23:37] <_joe_> nice [14:23:46] <_joe_> how did we not notice? [14:24:05] it's not in the calendar neither [14:24:13] <_joe_> yeah [14:24:20] <_joe_> when did that email arrive? [14:24:33] first one was on sept. 7 [14:24:38] <_joe_> nice [14:24:45] 2nd one 1h20min ago [14:24:49] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) is CRITICAL: [14:24:49] ktionary definitions for cat returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200): /{dom [14:24:49] -html/{title}{/revision}{/tid} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200): /{dom [14:25:09] I'll open a task [14:25:21] UTC: FRIDAY, 12 OCT 14:00 - SATURDAY, 13 OCT 00:00 [14:25:54] moritzm, _joe_ thanks for the quick notification/action [14:27:09] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:27:26] <_joe_> XioNoX: so I think that was the week when the person on clinic duty just forgot [14:29:28] (03CR) 10Imarlier: [C: 031] Enable base::service_auto_restart for navtiming [puppet] - 10https://gerrit.wikimedia.org/r/466855 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:30:09] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.64 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:32:26] 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10ayounsi) p:05Triage>03Normal [14:32:40] _joe_, moritzm, opened https://phabricator.wikimedia.org/T206861 [14:33:03] 10Operations, 10Traffic: 1 power feed down in eqsin - https://phabricator.wikimedia.org/T206861 (10ayounsi) [14:33:54] (03PS1) 10Jgreen: swap frauth2001 in for betelguese in nsca_frack_cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/466894 [14:34:01] another question is why Equinix does service impacting power work at 10pm Friday local time... [14:34:50] (03CR) 10Jgreen: [C: 032] swap frauth2001 in for betelguese in nsca_frack_cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/466894 (owner: 10Jgreen) [14:35:19] PROBLEM - IPMI Sensor Status on cp5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:36:44] ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 12.62 le 60 Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:36:49] can we place the mgmt switch and other single psu gear there on a pdu with automatic transfer switch? [14:37:29] ACKNOWLEDGEMENT - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 [14:37:29] ACKNOWLEDGEMENT - Host mr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 [14:37:29] ACKNOWLEDGEMENT - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 [14:37:29] ACKNOWLEDGEMENT - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 [14:37:29] ACKNOWLEDGEMENT - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2) Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 [14:38:00] PROBLEM - IPMI Sensor Status on lvs5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:38:40] ACKNOWLEDGEMENT - Juniper alarms on cr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - 4 red alarms, 0 yellow alarms Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:38:40] ACKNOWLEDGEMENT - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 73, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Power outage in eqsin - https://phabricator.wikimedia.org/T206861 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:40:19] PROBLEM - IPMI Sensor Status on cp5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:40:38] herron: it's an option, but it also mean more expensive and complex PDUs that can become the SPOF [14:40:41] long ago [14:40:49] oops, ignore me [14:41:25] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) With memcached wiped clean, and the parsercache databases basically void of useful content, almost all requests needed... [14:41:49] PROBLEM - IPMI Sensor Status on dns5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:43:29] PROBLEM - IPMI Sensor Status on dns5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:45:19] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) At the same time, a higher time for processing a single request meant that even in front of a substantially constant re... [14:46:49] PROBLEM - IPMI Sensor Status on lvs5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:47:29] PROBLEM - IPMI Sensor Status on lvs5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:48:29] PROBLEM - IPMI Sensor Status on cp5012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:48:29] PROBLEM - IPMI Sensor Status on cp5008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:48:36] lol [14:48:52] does redundancy count if it still manages to annoy us when we lose half the power? :) [14:53:39] PROBLEM - IPMI Sensor Status on cp5005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] [14:54:40] yeah because I believe at least it hasn't annoyed any of our users? ;_ [14:54:50] PROBLEM - IPMI Sensor Status on cp5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:54:50] PROBLEM - IPMI Sensor Status on cp5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:54:50] PROBLEM - IPMI Sensor Status on cp5007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:55:00] PROBLEM - IPMI Sensor Status on cp5011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:59:10] PROBLEM - IPMI Sensor Status on cp5010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [14:59:40] PROBLEM - IPMI Sensor Status on bast5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:00:22] PROBLEM - IPMI Sensor Status on cp5009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [15:02:33] PROBLEM - IPMI Sensor Status on cp5006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] [15:05:05] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) As far as MediaWiki fatals go, we had way less issues than one would expect given the graphs above. We had only ~ 1000... [15:05:56] 10Operations, 10DBA, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) Overall the absence of any valid parsercache entries can explain all the effects we've seen, except at least partially... [15:14:37] ACKNOWLEDGEMENT - IPMI Sensor Status on bast5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:37] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:37] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:37] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:37] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:37] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] Brandon Black eqsin power maintenance [15:14:37] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] Brandon Black eqsin power maintenance [15:14:38] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:38] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5008 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:39] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:39] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:14:40] ACKNOWLEDGEMENT - IPMI Sensor Status on cp5011 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Brandon Black eqsin power maintenance [15:15:04] ACKNOWLEDGEMENT - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:04] ACKNOWLEDGEMENT - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:04] ACKNOWLEDGEMENT - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:04] ACKNOWLEDGEMENT - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:04] ACKNOWLEDGEMENT - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:04] ACKNOWLEDGEMENT - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:04] ACKNOWLEDGEMENT - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:05] ACKNOWLEDGEMENT - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:05] ACKNOWLEDGEMENT - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:06] ACKNOWLEDGEMENT - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:06] ACKNOWLEDGEMENT - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:15:07] ACKNOWLEDGEMENT - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Brandon Black eqsin power maintenance [15:17:07] <_joe_> de-check "send notifications" next time :P [15:17:43] but then there's no irc-visible acknowledgement! :) [15:33:12] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.07 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:38:15] (03CR) 10Cwhite: [C: 031] Remove Diamond from Parsoid [puppet] - 10https://gerrit.wikimedia.org/r/465441 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:39:23] (03PS3) 10Cwhite: nutcracker: remove diamond collector resource [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454) [15:40:28] (03CR) 10Cwhite: [C: 032] nutcracker: remove diamond collector resource [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [15:48:31] !log repair /dev/sdh1 on ms-be1043 - T199198 [15:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:35] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [15:49:58] (03PS1) 10Banyek: admin: Change banyek's .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/466901 [15:50:04] !log repair /dev/sde1 on ms-be2041 - T199198 [15:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:28] (03PS4) 10Dzahn: base/icinga: use monitoring_hosts constant as NRPE allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) [15:52:44] (03CR) 10Dzahn: [C: 04-1] base/icinga: use monitoring_hosts constant as NRPE allowed_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [15:53:42] (03PS1) 10Cwhite: hiera: remove diamond on deployment_server role [puppet] - 10https://gerrit.wikimedia.org/r/466903 (https://phabricator.wikimedia.org/T183454) [15:53:44] (03PS1) 10Cwhite: hiera: remove diamond on dumps role [puppet] - 10https://gerrit.wikimedia.org/r/466904 (https://phabricator.wikimedia.org/T183454) [15:53:46] (03PS1) 10Cwhite: hiera: remove diamond from mediawiki role [puppet] - 10https://gerrit.wikimedia.org/r/466905 (https://phabricator.wikimedia.org/T183454) [15:53:48] (03PS1) 10Cwhite: hiera: remove diamond from scb role [puppet] - 10https://gerrit.wikimedia.org/r/466906 (https://phabricator.wikimedia.org/T183454) [15:53:50] (03PS1) 10Cwhite: hiera: remove diamond from thumbor role [puppet] - 10https://gerrit.wikimedia.org/r/466907 (https://phabricator.wikimedia.org/T183454) [15:53:52] (03PS1) 10Cwhite: hiera: remove diamond from wmcs role [puppet] - 10https://gerrit.wikimedia.org/r/466908 (https://phabricator.wikimedia.org/T183454) [15:55:35] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10DStrine) I'm relaying an email from Lisa: Lisa Gruwell Thu, Oct 11, 6:13 PM (14 hours ago) to Jerry, Caitlin, me... [15:58:40] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) Hi @jkim_wikimedia there is one more thing, besides the SSH key, that we will need. Please go to Wikitech... [15:58:58] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) p:05Triage>03Normal [15:59:28] (03Abandoned) 10Cwhite: nutcracker: set diamond::remove on all roles containing nutcracker [puppet] - 10https://gerrit.wikimedia.org/r/464918 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [16:01:03] RECOVERY - Filesystem available is greater than filesystem size on ms-be1043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops [16:02:08] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10Dzahn) Hello @LarsWirzenius Could you please: - go to Wikitech wiki and create a user there , direct link: https://wikitech.wikimedia.org/w/index... [16:02:25] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10Dzahn) p:05Triage>03Normal [16:03:32] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:10:52] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:12:27] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) @JGreen @cwdent Could you advise how access requests for FRACK are usually handled from here? Do you also ma... [16:17:03] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200) [16:18:33] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [16:19:13] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [16:19:22] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for Wikipedia & Education User Group - https://phabricator.wikimedia.org/T206566 (10Dzahn) You have successfully created the mailing list eduwiki and notification has been sent to the list owner dungodung@gmail.com. You can now: [[ https://lists.w... [16:20:31] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for Wikipedia & Education User Group - https://phabricator.wikimedia.org/T206566 (10Dzahn) 05Open>03Resolved a:03Dzahn [16:22:54] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Allow Analytics team members to restart Turnilo and Superset - https://phabricator.wikimedia.org/T206217 (10Dzahn) I had sent a mail to list to speed this up because there was no SRE meeting this week. But i have not had any responses, s... [16:28:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.08 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:28:32] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 77 probes of 343 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:29:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 77 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:30:33] (03CR) 10Aaron Schulz: mcrouter: allow defining a non-default number of backend connectors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) (owner: 10Giuseppe Lavagetto) [16:38:33] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 20 probes of 343 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:39:21] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Dzahn) a:03cwdent [16:39:53] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [16:44:09] 10Operations, 10monitoring: Setup metrics monitoring for OpenLDAP/corp - https://phabricator.wikimedia.org/T206327 (10Dzahn) p:05Triage>03Normal [16:47:43] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:50:09] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Hi @Dzahn ! My user name is 'jkim'. Still working on the SSH key... [16:50:43] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received [16:50:56] (03CR) 10Bstorm: "Looks good. It doesn't break the tests. There might be comments to update, though. Checking that." [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:51:43] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [16:53:53] (03CR) 10Elukey: "We have also discussed what would be the best number of connections to open to each memcached shard for each mcrouter, and something like " [puppet] - 10https://gerrit.wikimedia.org/r/466881 (https://phabricator.wikimedia.org/T203786) (owner: 10Giuseppe Lavagetto) [16:54:12] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [16:54:12] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200) [16:54:17] (03CR) 10Bstorm: [C: 031] "I may want to add an additional test case to the rspec for this, but I like it. If you don't feel comfortable with rspec, please make me " [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [16:55:12] RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy [16:55:13] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [16:58:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [17:05:02] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.32 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:15:54] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:23] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 62 probes of 342 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:16:54] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.008 second response time [17:16:55] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.35 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:17:33] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 44 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:21:23] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 10 probes of 342 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:22:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 25 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [17:25:13] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) [17:25:15] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T206651 (10Dzahn) p:05Triage>03Normal [17:25:30] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T206651 (10Dzahn) [17:26:22] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Convert automation scripts to spicerack cookbooks - https://phabricator.wikimedia.org/T203943 (10Dzahn) p:05Triage>03Normal [17:26:23] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [17:27:01] 10Operations: SRE quarterly goal: Ability to serve a fraction of the production traffic from PHP7 - https://phabricator.wikimedia.org/T206336 (10Dzahn) p:05Triage>03High [17:27:35] 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Dzahn) a:03Dzahn [17:27:47] 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Dzahn) p:05Triage>03Normal [17:28:37] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10Jgreen) 05Open>03Resolved [17:31:34] 10Operations, 10ops-codfw, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Jgreen) [17:37:10] RECOVERY - puppet last run on wdqs1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:40:24] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) I have replied by email and referred to this ticket. I also pointed out we don't agree that this check is even useful (but that it was fixed nevertheless).... [17:41:05] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) 05Open>03Resolved Thanks again. As always it's greatly appreciated that you add all the technical details and background. [17:41:39] 10Operations, 10ops-codfw, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Dzahn) p:05Triage>03Normal [17:42:14] 10Operations, 10ops-codfw, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Dzahn) https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Decommissioned [17:42:17] (03PS1) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) [17:43:04] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Dzahn) [17:43:50] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Dzahn) [17:43:59] (03PS2) 10Cwhite: nagios_common: set flag -2 on check_nrpe for nrpe on stretch [puppet] - 10https://gerrit.wikimedia.org/r/466935 (https://phabricator.wikimedia.org/T202782) [17:46:47] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10Dzahn) p:05Triage>03Normal [17:48:18] 10Operations, 10Wikimedia-Mailing-lists: Change digest function of wikimedia-l@ so it send emails only once a day - https://phabricator.wikimedia.org/T141566 (10Dzahn) 05stalled>03declined [17:51:13] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10cwdent) Hi @jkim_wikimedia - sorry for the confusion, I'll be making this account for you. Do you have a yubikey? [17:54:00] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:54:21] (03PS3) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [17:56:10] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:58:23] 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) affected lists as of today: [[ https://lists.wikimedia.org/mailman/listinfo/affcom-members | Affcom-members ]] [[ https://lists.wikimedia.org/mailman/listinfo/betacluster-a... [18:02:12] 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops: decommission betelgeuse - https://phabricator.wikimedia.org/T206870 (10Jgreen) [18:04:45] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances1-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) p:05Triage>03High [18:06:21] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10aborrero) [18:06:39] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) So the vlans show: ``` default-switch cloud-instances1-b-eqiad 1102... [18:07:09] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:12:34] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) So to show the output: ``` robh@asw2-b-eqiad> show interfaces descriptions | grep cloudvirt1023 ge-1/0/8 up u... [18:14:10] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 66 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:15:24] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10Papaul) ``` papaul@asw2-b-eqiad# ...eqiad unit 0 family ethernet-switching vlan members cloud- {master:2}[edit] papaul@asw2... [18:20:25] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) So, the above from @papaul would add in the vlan a second time, since hte current config already shows: ``` interface-ra... [18:25:26] !log running modified attachLatest.php script over ~9000 pages on wikidatawiki (with added wait for slaves) T206743 [18:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:29] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [18:32:31] 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) Automatic mail to primary list admins of all lists without description sent with: ``` for list in $(/var/lib/mailman/bin/list_lists | grep "no description" | sed 's/ - \[n... [18:34:05] 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) ``` mailing primary admin of Ac-temp to set a description .. mailing primary admin of Advisory to set a description mailing primary admin of Affcom-members to set a descript... [18:34:15] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) >>! In T206872#4661995, @Papaul wrote: > ``` > papaul@asw2-b-eqiad# ...eqiad unit 0 family ethernet-switching vlan member... [18:34:20] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:34:36] 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) p:05Triage>03Low [18:37:06] !log modified attachLatest.php script finished running over 9395 pages T206743 [18:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:10] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [18:37:55] (03PS2) 10Dzahn: network::constants: remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465686 (https://phabricator.wikimedia.org/T201343) [18:41:39] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:50:07] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Jdforrester-WMF) [18:56:09] !log restarted vp9 background transcodes in eqiad, via mwmaint1002 [18:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:11] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 (10Dzahn) bpfilter made it into 4.18 kernel and there are claims that it would "eventually replace both iptables and nftables" [19:04:06] (03CR) 10Dzahn: [C: 032] network::constants: remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465686 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [19:08:12] (03PS1) 10Dzahn: DHCP: fix mwmaint1001 -> mw1297 fixed address [puppet] - 10https://gerrit.wikimedia.org/r/466947 (https://phabricator.wikimedia.org/T192457) [19:09:06] (03CR) 10Dzahn: [C: 032] DHCP: fix mwmaint1001 -> mw1297 fixed address [puppet] - 10https://gerrit.wikimedia.org/r/466947 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [19:12:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:15:40] PROBLEM - Device not healthy -SMART- on heze is CRITICAL: cluster=misc device=megaraid,8 instance=heze:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=heze&var-datasource=codfw%2520prometheus%252Fops [19:19:19] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 49 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [19:32:38] (03PS1) 10Dzahn: nagios_common: add stretch support to check_ssl [puppet] - 10https://gerrit.wikimedia.org/r/466951 [19:37:34] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10faidon) [19:37:38] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10faidon) 05Open>03Resolved a:03faidon FWIW, this was sorted out for both cloudvirt1023 and cloudvirt1024 in exactly the way... [19:38:22] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10Andrew) thanks all! [19:42:39] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10Andrew) Would you expect us to need to reinstall the OS (or otherwise make interface changes) after this change? [19:43:01] PROBLEM - Host cloudvirt1019 is DOWN: PING CRITICAL - Packet loss = 100% [19:43:25] andrewbogott ^^ [19:43:50] paladox: I think Chris is still working on that one [19:43:55] oh i see [19:48:59] Yes, I’m sorry I did not extend the downtime. I’m waiting on next steps from HP [19:51:33] (03PS2) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981) [19:51:36] (03PS2) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) [19:51:37] (03PS2) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) [19:51:39] (03PS2) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981) [19:51:41] (03PS1) 10Jforrester: Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) [19:51:43] (03PS1) 10Jforrester: Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) [19:51:45] (03PS1) 10Jforrester: Enable WikibaseMediaInfo on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708) [19:51:50] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2510 MB (5% inode=67%) [19:52:27] (03CR) 10Jforrester: [C: 04-2] "Blocked on Security Review sign-off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446841 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [19:52:34] (03CR) 10Jforrester: [C: 04-2] "Blocked on Security Review sign-off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446842 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [19:52:46] (03CR) 10Jforrester: [C: 04-2] "Filled with FIXMEs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester) [19:52:51] (03CR) 10jerkins-bot: [V: 04-1] Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester) [19:53:06] (03CR) 10Jforrester: "Blocked on Security Review sign-off." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [19:53:12] (03CR) 10jerkins-bot: [V: 04-1] Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [19:53:19] (03CR) 10jerkins-bot: [V: 04-1] Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [19:53:32] (03CR) 10jerkins-bot: [V: 04-1] Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [19:53:38] (03Abandoned) 10Jforrester: Enable the WikibaseMediaInfo extension in Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446845 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [19:53:52] (03CR) 10Jforrester: [C: 04-2] "Absolutely not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708) (owner: 10Jforrester) [19:53:55] (03CR) 10jerkins-bot: [V: 04-1] Enable WikibaseMediaInfo on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708) (owner: 10Jforrester) [19:54:00] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10RobH) I wouldn't think so, nope. The OS install doesn't really make use of eth1... [19:58:47] 10Operations, 10WMF-Blog-Social-Team, 10Wikimedia-Mailing-lists: Request mailman list "worldcup2018" for upcoming affiliate campaign - https://phabricator.wikimedia.org/T196003 (10Aklapper) [19:58:59] 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Aklapper) >>! In T179568#4661941, @Dzahn wrote: > https://lists.wikimedia.org/mailman/listinfo/worldcup2018 (How is this even remotely Wikimedia related?) @dzahn: T196003; maybe... [19:59:30] (03CR) 10Jforrester: [C: 04-2] Install but don't enable the WikibaseMediaInfo extension, part III (038 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) (owner: 10Jforrester) [20:00:00] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [20:00:21] PROBLEM - Nginx local proxy to jobrunner on mw1299 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:24] (03PS2) 10Dzahn: nagios_common: add stretch support to check_ssl [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782) [20:01:10] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [20:01:19] RECOVERY - Nginx local proxy to jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.077 second response time [20:02:50] (03PS3) 10Dzahn: icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) [20:03:11] (03PS3) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446843 (https://phabricator.wikimedia.org/T137444) [20:03:13] (03PS2) 10Jforrester: Install but don't enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466953 (https://phabricator.wikimedia.org/T180981) [20:03:15] (03PS2) 10Jforrester: Enable WikibaseMediaInfo on Beta Cluster Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466954 (https://phabricator.wikimedia.org/T180981) [20:03:17] (03PS3) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981) [20:03:19] (03PS2) 10Jforrester: Enable WikibaseMediaInfo on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708) [20:03:30] (03CR) 10jerkins-bot: [V: 04-1] icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:10:00] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:11:31] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10aborrero) [20:11:39] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10aborrero) 05Resolved>03Open This still doesn't work. Same behavior as before: packets go out of the interface but nothing en... [20:13:42] 10Operations, 10Traffic: certcentral: check for SCTs, with optional disable per-account - https://phabricator.wikimedia.org/T206876 (10BBlack) p:05Triage>03Normal [20:14:39] RECOVERY - Disk space on contint1001 is OK: DISK OK [20:14:52] ^ i gzipped a bunch of zuul logs for that [20:15:04] we didn't want jerkins to run out of disk [20:17:19] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 56 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:29:15] 10Operations, 10Fundraising-Backlog, 10SRE-Access-Requests, 10Wikimedia-Fundraising, 10fundraising-tech-ops: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10jkim_wikimedia) Hi @cwdent - I do! [20:32:25] 10Operations, 10Wikimedia-Mailing-lists: all mailing lists should have descriptions - https://phabricator.wikimedia.org/T179568 (10Dzahn) @Aklapper thanks. i didn't know the "for affiliate campaign" part of it. That makes it work related. [20:40:15] (03CR) 10Smalyshev: wdqs: cleanup logback configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [20:40:52] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: PHP fatal error only in one portal in arwiki - https://phabricator.wikimedia.org/T206878 (10Reedy) [20:41:00] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: PHP fatal error only in one portal in arwiki - https://phabricator.wikimedia.org/T206878 (10alanajjar) [20:44:10] RECOVERY - IPMI Sensor Status on dns5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:44:11] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994 (10aborrero) From https://blogs.gnome.org/dcbw/2018/07/27/the-ascendance-of-nftables/ : ``` What about eBPF? You might have heard that eBPF will replace everything and give everyone a unicorn. It might, if... [20:47:39] RECOVERY - IPMI Sensor Status on lvs5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:48:10] RECOVERY - IPMI Sensor Status on lvs5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:48:31] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: PHP fatal error only in one portal in arwiki - https://phabricator.wikimedia.org/T206878 (10alanajjar) >Note: one user said that he can open it, but can't save anything in this page. Sorry for the bad quality of this image, but this user faced thi... [20:49:20] RECOVERY - IPMI Sensor Status on cp5012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:49:20] RECOVERY - IPMI Sensor Status on cp5008 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:54:02] 10Operations, 10Wikidata, 10Wikidata-Query-Service: WDQS disk usage increase is correlated with reloading of categories - https://phabricator.wikimedia.org/T200202 (10Smalyshev) 05Open>03Resolved a:03Smalyshev Does not happen anymore since we're using dailies. [20:54:30] RECOVERY - IPMI Sensor Status on cp5005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:55:40] RECOVERY - IPMI Sensor Status on cp5004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:55:40] RECOVERY - IPMI Sensor Status on cp5001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [20:55:40] (03CR) 10Cwhite: nagios_common: add stretch support to check_ssl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:56:08] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10Papaul) I checked the other cloudvirt node that are working (1021,1020 and 1022) all of their eth1 is part of the interface-rang... [21:00:00] RECOVERY - IPMI Sensor Status on cp5010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [21:03:19] RECOVERY - IPMI Sensor Status on cp5006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [21:06:10] RECOVERY - IPMI Sensor Status on cp5002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [21:10:12] (03PS1) 10Cwhite: phabricator: remove custom diamond::collector [puppet] - 10https://gerrit.wikimedia.org/r/466988 (https://phabricator.wikimedia.org/T183454) [21:10:14] (03PS1) 10Cwhite: phabricator: remove diamond::collector and purge diamond [puppet] - 10https://gerrit.wikimedia.org/r/466989 (https://phabricator.wikimedia.org/T183454) [21:11:00] RECOVERY - IPMI Sensor Status on cp5003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [21:11:06] (03PS1) 10Bstorm: labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990 [21:13:52] 10Operations, 10Wikimedia-General-or-Unknown, 10Performance: arwiki page giving "entire web request took longer than 60 seconds and timed out" - https://phabricator.wikimedia.org/T206878 (10Reedy) [21:17:22] (03PS2) 10Bstorm: labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990 [21:29:01] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite) [21:31:43] (03PS3) 10Bstorm: labstore: make nfsd-ldap package required for jessie, but not stretch [puppet] - 10https://gerrit.wikimedia.org/r/466990 [21:33:20] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:37:06] (03CR) 10Bstorm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/466990 (owner: 10Bstorm) [21:37:52] (03CR) 10Bstorm: "The puppet compiler seems ok with this approach on the existing jessie servers." [puppet] - 10https://gerrit.wikimedia.org/r/466990 (owner: 10Bstorm) [21:38:08] (03CR) 10Bstorm: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/466990 (owner: 10Bstorm) [21:40:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 58 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:57:56] (03CR) 10Faidon Liambotis: [C: 04-1] cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [22:00:53] (03CR) 10Bstorm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:01:07] (03PS24) 10Bstorm: WIP toolforge: write/move a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) [22:06:11] (03CR) 10Gehel: wdqs: cleanup logback configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [22:07:00] (03PS8) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [22:09:00] PROBLEM - Memory correctable errors -EDAC- on db1069 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [22:10:47] (03CR) 10Bstorm: [C: 032] WIP toolforge: write/move a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) (owner: 10Bstorm) [22:13:36] (03CR) 10Smalyshev: wdqs: cleanup logback configuration (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [22:14:42] (03CR) 10Faidon Liambotis: [C: 04-1] "I have a proposal for an easier solution: libmonitoring-plugin-perl is available in jessie-backports, so just ensure => installed that on " [puppet] - 10https://gerrit.wikimedia.org/r/466951 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [22:21:07] 10Operations, 10Cloud-VPS (Ubuntu Trusty Deprecation): cloudvps: toolserver-legacy project trusty deprecation - https://phabricator.wikimedia.org/T204564 (10bd808) [22:21:26] 10Puppet, 10Cloud-VPS (Ubuntu Trusty Deprecation): cloudvps: puppet project trusty deprecation - https://phabricator.wikimedia.org/T204558 (10bd808) [22:36:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:42:38] (03PS1) 10Smalyshev: My tests show that Kafka poller behaves much better with -b 700 [puppet] - 10https://gerrit.wikimedia.org/r/467002 [22:43:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:33:03] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10faidon) [23:33:12] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): cannot add cloudvirt1023 eth1 to cloud-instances2-b-eqiad vlan - https://phabricator.wikimedia.org/T206872 (10faidon) 05Open>03Resolved >>! In T206872#4662465, @Papaul wrote: > I checked the other cloudvirt node that are working (1021... [23:34:29] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [23:41:40] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 53 probes of 318 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts