[00:05:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10bd808) [00:05:22] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS: Fewer transitory middle-of-the-night puppet alerts - https://phabricator.wikimedia.org/T206224 (10bd808) [00:05:40] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10bd808) [00:07:53] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [00:09:34] RECOVERY - puppet last run on kafka-jumbo1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:11:44] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:16:50] 10Puppet, 10Cloud-Services, 10Documentation: Missing documentation for labs puppet roles - https://phabricator.wikimedia.org/T91770 (10bd808) 05Open>03declined Puppet functionality has been removed from OpenStackManager. The new location for this is Horizon. The configuration screens in Horizon do not at... [00:31:34] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:42:13] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:47:04] PROBLEM - puppet last run on mw1301 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:51:04] 10Operations, 10fundraising-tech-ops, 10netops: Qualys scans causing problematic pfw logspam - https://phabricator.wikimedia.org/T206431 (10cwdent) [01:01:54] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:05:53] (03PS2) 10BryanDavis: jdk8: Switch base image to Stretch [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/463877 (https://phabricator.wikimedia.org/T205774) [01:15:54] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:17:33] RECOVERY - puppet last run on mw1301 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:20:23] PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:21:33] PROBLEM - puppet last run on mw1338 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:39:33] PROBLEM - puppet last run on cp1077 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:41:04] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [01:46:53] RECOVERY - puppet last run on mw1338 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:50:33] RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:59:14] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:04:44] RECOVERY - puppet last run on cp1077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:08:43] PROBLEM - puppet last run on analytics-tool1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:17:34] PROBLEM - puppet last run on prometheus2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:24:24] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:24:43] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:37:24] (03PS3) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) [02:37:45] (03CR) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [02:39:04] RECOVERY - puppet last run on analytics-tool1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [02:43:03] RECOVERY - puppet last run on prometheus2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:44:54] PROBLEM - puppet last run on cloudvirt1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:49:54] RECOVERY - puppet last run on cloudvirt1020 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:50:57] (03CR) 10Zhuyifei1999: [C: 031] jdk8: Switch base image to Stretch [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/463877 (https://phabricator.wikimedia.org/T205774) (owner: 10BryanDavis) [02:51:23] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:54:44] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:54:47] 10Operations, 10Traffic, 10fundraising-tech-ops, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Liuxinyu970226) @krenair please, no more DV certs, that's the reason why jawiki, ugwiki, wuuwiki, zhwiki, zh-yuewiki and zhwikinews are SNI RSTed by... [03:00:12] 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Liuxinyu970226) [03:00:25] 10Operations, 10Traffic, 10Patch-For-Review: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Liuxinyu970226) [03:03:19] (03PS5) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [03:08:14] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:08:55] (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [03:10:24] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:15:44] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:21:43] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:27:43] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:29:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 904.81 seconds [03:31:52] Is it OK to deploy cxserver update today() (in few minutes)? [03:41:04] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [03:46:13] PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:52:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 211.45 seconds [03:54:43] PROBLEM - puppet last run on logstash1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:55:03] PROBLEM - puppet last run on cp1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:58:13] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:59:13] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:11:33] RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:24:53] RECOVERY - puppet last run on logstash1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:25:14] RECOVERY - puppet last run on cp1085 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:29:24] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:16:20] 10Operations, 10ops-eqiad, 10DBA: db1064 has disk smart error - https://phabricator.wikimedia.org/T206245 (10Marostegui) 05Open>03Resolved a:03Marostegui Thanks! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Primary-1, Secondary-0, RAID Level... [05:18:38] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) This disk has been marked with errors, can we get a different one? ``` Span: 5 - Number of PDs: 2 PD: 0 Information Enclosure Device ID: 32 Slot Number: 10 Drive's position: DiskGroup: 0, Spa... [05:20:31] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) The disk failed to rebuild, if this is a brand new disk, can you pull it out wait a couple of minutes and then pull it in back? Thanks! ``` Enclosure Device ID: 32 Slot Number: 3 Drive's posi... [05:21:13] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1072 is CRITICAL: cluster=mysql device=megaraid,10 instance=db1072:9100 job=node site=eqiad Marostegui T206313 - The acknowledgement expires at: 2018-10-11 05:20:56. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1072&var-datasource=eqiad%2520prometheus%252Fops [05:41:44] (03PS1) 10Marostegui: db-eqiad.php: Repool db1092 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465120 (https://phabricator.wikimedia.org/T205514) [06:04:58] marostegui: Is it OK to udpate cxserver in Production today? [06:06:27] kart_: What does that imply? There is a train freeze this week as in two days we have the DC failover back to eqiad. So I would say, if you can wait a couple of days (failover is the 10th), maybe worth waiting [06:06:51] But I have not much idea on what does that update imply, maybe another SRE with more experience with that can help you with the decision [06:07:27] OK. It is cxserver service update, independent from train. [06:07:56] kart_: I know, my point is that if we have blocked the train, maybe this should also not happen. [06:08:10] OK. Let's wait then! :) [06:08:13] kart_: Again, maybe ping some other SRE with more experience with this service to evaluate the yes/no decision :-) [06:08:21] Sure. [06:09:25] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Repackaging for stretch [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/464507 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto) [06:11:17] <_joe_> kart_: I don't see anything planned on the deployment calendar, and frankly I think anything that's not an emergency or a small configuration should wait for next week [06:12:34] <_joe_> it's not like cxserver is not part of our production environment [06:27:19] _joe_: ack. will wait. [06:27:49] _joe_: generally, I don't add it to calendar for service update as it is as required. but it seems good if we do that. [06:28:46] <_joe_> kart_: I was mostly saying if you want to release a new version of a software during a freeze, that needs to be discussed with release engineering and us, and must have some urgency [06:29:05] <_joe_> if those things were done, the deployment request would be in the deployment calendar [06:29:13] * _joe_ bbl [06:29:21] _joe_: noted. nothing breaking as of now. [07:00:53] !log installing git security updates [07:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:47] (03CR) 10Banyek: [C: 031] "It's a replica, so it just have to deal with writes directly from replication, I'd say lgtm. In worst case scenario we can revert this is " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465120 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [07:16:59] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1092 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465120 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [07:17:45] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) [07:18:42] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1092 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465120 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [07:20:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1092 with low weight - T205514 (duration: 01m 27s) [07:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:36] T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 [07:24:43] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 573.1 ge 4 Muehlenhoff T194174 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw%2520prometheus%252Fops [07:25:59] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for PowerDNS Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/464778 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:26:13] (03CR) 10Banyek: [C: 032] wikireplicas: enable wmf-pt-kill service [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek) [07:26:27] (03PS3) 10Banyek: wikireplicas: enable wmf-pt-kill service [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674) [07:26:33] (03CR) 10Banyek: [V: 032 C: 032] wikireplicas: enable wmf-pt-kill service [puppet] - 10https://gerrit.wikimedia.org/r/464821 (https://phabricator.wikimedia.org/T203674) (owner: 10Banyek) [07:27:01] !log enabling first time wmf-pt-kill on labsdb1010 [07:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:23] banyek: shall I merge your patch along? [07:27:35] yes please :) [07:28:15] done :-) [07:28:52] tx [07:30:43] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1092 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465120 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui) [07:35:54] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:36:45] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Services (done): Add sudo rules for wdqs-updater in puppet - https://phabricator.wikimedia.org/T206303 (10mobrovac) 05Open>03Resolved [07:38:20] !log reducing relative weight of wdqs2003 in pybal - T206423 [07:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:24] !log gehel@puppetmaster1001 conftool action : set/weight=15; selector: dc=codfw,cluster=wdqs,name=wdqs2001.eqiad.wmnet [07:38:24] T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 [07:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:29] !log gehel@puppetmaster1001 conftool action : set/weight=15; selector: dc=codfw,cluster=wdqs,name=wdqs2002.eqiad.wmnet [07:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:38] <_joe_> !log disabling puppet, doing etcd tests on lvs1006 [07:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:43] !log Disable GTID on s1,s2,s3,s4,s6,s7,s8 eqiad masters in preparation for enabling replication eqiad -> codfw [07:40:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:27] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) [07:43:59] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-rabbitmq-exporter [puppet] - 10https://gerrit.wikimedia.org/r/464769 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:46:02] (03CR) 10Mobrovac: [C: 04-1] scap::target: added services_names param (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464807 (https://phabricator.wikimedia.org/T206314) (owner: 10Mathew.onipe) [07:46:54] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/465014 (owner: 10Volans) [07:48:34] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=42) [07:48:52] 10Operations, 10Release-Engineering-Team, 10Scap, 10Discovery-Search (Current work), and 2 others: Modify scap::target to define sudo rules for multiple services - https://phabricator.wikimedia.org/T206314 (10mobrovac) [07:50:37] !log Enabling replication eqiad -> codfw in preparation for DC failover [07:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:24] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:59:27] (03PS1) 10Joal: Add http_proxy to profile::analytics::refinery::job::project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/465125 [08:00:14] (03CR) 10jerkins-bot: [V: 04-1] Add http_proxy to profile::analytics::refinery::job::project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/465125 (owner: 10Joal) [08:04:14] (03CR) 10Muehlenhoff: [C: 031] "Looks good. Once puppet has run on all hosts which have the collector, best to send a followup patch which removes the collector from pupp" [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:07:04] (03PS2) 10Joal: Add http_proxy to profile::analytics::refinery::job::project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/465125 [08:07:39] (03CR) 10jerkins-bot: [V: 04-1] Add http_proxy to profile::analytics::refinery::job::project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/465125 (owner: 10Joal) [08:09:29] (03CR) 10Muehlenhoff: [C: 031] cumin: fix alias query [puppet] - 10https://gerrit.wikimedia.org/r/465012 (owner: 10Volans) [08:09:55] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:10:40] (03PS3) 10Volans: cumin: fix alias query [puppet] - 10https://gerrit.wikimedia.org/r/465012 [08:12:20] (03CR) 10Volans: [C: 032] cumin: fix alias query (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465012 (owner: 10Volans) [08:12:55] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:12:59] (03PS2) 10Volans: cumin: alias checker, catch exception [puppet] - 10https://gerrit.wikimedia.org/r/465014 [08:14:48] (03CR) 10Volans: [C: 032] cumin: alias checker, catch exception [puppet] - 10https://gerrit.wikimedia.org/r/465014 (owner: 10Volans) [08:16:47] (03PS3) 10Joal: Add http_proxy to profile::analytics::refinery::job::project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/465125 [08:19:30] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 (10Gehel) Looking at [[ https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&from=153894... [08:19:44] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) [08:20:16] !log gehel@puppetmaster1001 conftool action : set/weight=15; selector: dc=codfw,cluster=wdqs,name=wdqs2002.codfw.wmnet [08:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:21] !log gehel@puppetmaster1001 conftool action : set/weight=15; selector: dc=codfw,cluster=wdqs,name=wdqs2001.codfw.wmnet [08:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:36] !log Disable gtid on es2 and es3 eqiad master [08:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:43] (03PS2) 10Vgutierrez: install_server: Add certcentral[12]001 to the DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/464806 (https://phabricator.wikimedia.org/T206308) [08:24:09] (03PS4) 10Elukey: Add http_proxy to profile::analytics::refinery::job::project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/465125 (owner: 10Joal) [08:25:44] (03CR) 10Vgutierrez: [C: 032] install_server: Add certcentral[12]001 to the DHCP configuration [puppet] - 10https://gerrit.wikimedia.org/r/464806 (https://phabricator.wikimedia.org/T206308) (owner: 10Vgutierrez) [08:25:46] (03CR) 10Elukey: "two nits and then we are ready to go!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465125 (owner: 10Joal) [08:27:08] (03PS1) 10Filippo Giunchedi: swift: fix user/group for log files/dirs [puppet] - 10https://gerrit.wikimedia.org/r/465127 (https://phabricator.wikimedia.org/T205974) [08:28:01] (03CR) 10Muehlenhoff: [C: 04-1] "See my comment on 464917, this needs to fixed first. Also, did you check that all of these roles use no other collectors from the base Dia" [puppet] - 10https://gerrit.wikimedia.org/r/464918 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:30:54] (03PS5) 10Joal: Add http_proxy to profile::analytics::refinery::job::project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/465125 [08:30:57] (03CR) 10Joal: Add http_proxy to profile::analytics::refinery::job::project_namespace_map (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465125 (owner: 10Joal) [08:33:37] (03CR) 10Elukey: [C: 032] Add http_proxy to profile::analytics::refinery::job::project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/465125 (owner: 10Joal) [08:41:34] (03PS1) 10Vgutierrez: install_server: provide netboot config for certcentral[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/465129 (https://phabricator.wikimedia.org/T206308) [08:42:21] (03CR) 10Muehlenhoff: [C: 04-1] ntp: move diamond::collector to where it will only apply to ntp servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464866 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:42:53] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) [08:46:20] (03PS1) 10Elukey: druid: replace analytics1003 with an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/465130 (https://phabricator.wikimedia.org/T203635) [08:48:22] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for ircecho [puppet] - 10https://gerrit.wikimedia.org/r/464791 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:48:24] PROBLEM - confd service on rdb1003 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [08:49:33] RECOVERY - confd service on rdb1003 is OK: OK - confd is active [08:56:08] (03CR) 10Filippo Giunchedi: "Note that there are few cases where memcached is present (the class) but not memcached_exporter, one way to audit these is:" [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:57:11] (03PS2) 10Vgutierrez: install_server: provide netboot config for certcentral[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/465129 (https://phabricator.wikimedia.org/T206308) [08:59:15] (03CR) 10Vgutierrez: [C: 032] install_server: provide netboot config for certcentral[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/465129 (https://phabricator.wikimedia.org/T206308) (owner: 10Vgutierrez) [09:05:36] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10akosiaris) 05Open>03Resolved And now we got ``` sudo /usr/local/lib/nagios/plugins/check_raid OK: optimal, 1 logical, 12 physical OK ``` Great. Thanks @Cmjohnson [09:06:48] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [09:10:46] (03CR) 10Muehlenhoff: "Looks good (once 464866 is fixed), but personally I'd rather disable this via the role instead via specific hosts entries (i.e. hieradata/" [puppet] - 10https://gerrit.wikimedia.org/r/464867 (owner: 10Cwhite) [09:14:07] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991) [09:16:52] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for varnish-hospital [puppet] - 10https://gerrit.wikimedia.org/r/464502 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:22:37] jouncebot, next [09:22:38] In 1 hour(s) and 37 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181008T1100) [09:23:19] (03PS2) 10Elukey: druid: replace analytics1003 with an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/465130 (https://phabricator.wikimedia.org/T203635) [09:24:31] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) [09:26:40] (03CR) 10Elukey: [C: 032] druid: replace analytics1003 with an-coord1001 [puppet] - 10https://gerrit.wikimedia.org/r/465130 (https://phabricator.wikimedia.org/T203635) (owner: 10Elukey) [09:30:05] (03PS5) 10Muehlenhoff: Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) [09:30:55] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for varnish-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/464516 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:30:57] 10Operations, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10Joe) [09:32:13] 10Operations, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10Joe) [09:32:16] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10Joe) [09:33:47] (03PS7) 10Ema: Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [09:34:58] (03CR) 10Ema: [C: 032] Backend-Timing Varnish mtail program [puppet] - 10https://gerrit.wikimedia.org/r/434879 (https://phabricator.wikimedia.org/T131894) (owner: 10Gilles) [09:36:31] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [09:38:48] 10Operations, 10Wikimedia-Site-requests, 10Commons: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751 (10Urbanecm) I urge deployers and others with proper access to fulfil this. It is waiting since May. @Dereckson, @Reedy, anybody to process? [09:44:24] (03PS1) 10Ema: cache: enable varnish::logging::backendtiming [puppet] - 10https://gerrit.wikimedia.org/r/465136 (https://phabricator.wikimedia.org/T131894) [09:45:06] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), and 2 others: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105 (10Mvolz) I'm getting reports of some DOIs still bei... [09:51:34] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [09:51:47] !log rebuild sdc sdh sdj sdi on ms-be2041 with crc=1 finobt=0 - T199198 [09:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:51] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [09:55:35] !log installing python3.5/python2.7 security updates [09:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:26] (03PS1) 10Muehlenhoff: Remove Diamond from Hadoop systems [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) [10:04:30] \o/ [10:07:00] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10fgiunchedi) p:05Triage>03Normal [10:07:27] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [10:07:29] 10Operations, 10Wikimedia-Logstash: Investigate Kafka main cluster usage for logging pipeline - https://phabricator.wikimedia.org/T205873 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Looks like we have a way forward! Resolving in favor of {T206454} to track the actual Kafka setup work. [10:22:43] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:23:15] (03PS1) 10Giuseppe Lavagetto: lvs: switch eqiad/esams to the new eqiad etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/465138 (https://phabricator.wikimedia.org/T205814) [10:23:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:24:08] (03PS1) 10Vgutierrez: site: add certcentral[12]001 as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/465139 (https://phabricator.wikimedia.org/T206308) [10:24:13] (03CR) 10Elukey: [C: 031] lvs: switch eqiad/esams to the new eqiad etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/465138 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto) [10:26:20] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465140 (https://phabricator.wikimedia.org/T128546) [10:28:23] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: switch eqiad/esams to the new eqiad etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/465138 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto) [10:31:34] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465140 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:32:34] PROBLEM - Restbase root url on restbase2001 is CRITICAL: HTTP CRITICAL - No data received from host [10:32:51] <_joe_> elukey: damn this will make the lvs servers alarm badly [10:33:07] <_joe_> I have to fix the whole puppettization of the thing [10:33:44] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 16081 bytes in 0.124 second response time [10:35:33] PROBLEM - PyBal connections to etcd on lvs1001 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001:2379 (min=4) [10:35:43] <_joe_> this is expected ^^ [10:36:24] ack [10:37:13] PROBLEM - PyBal connections to etcd on lvs3004 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001:2379 (min=8) [10:37:23] :( [10:41:05] <_joe_> elukey: I'm fixing this, but I'm trying to do a proper fix and it will take time [10:41:45] !log restart mcrouter on mw2201 with more verbose logging settings as test [10:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:12] _joe_ I can prepare the srv records dns change in the meantime [10:42:21] <_joe_> elukey: thanks [10:52:41] (03PS2) 10Vgutierrez: site: add certcentral[12]001 as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/465139 (https://phabricator.wikimedia.org/T206308) [10:53:36] (03CR) 10Vgutierrez: [C: 032] site: add certcentral[12]001 as spare systems [puppet] - 10https://gerrit.wikimedia.org/r/465139 (https://phabricator.wikimedia.org/T206308) (owner: 10Vgutierrez) [10:56:45] (03PS1) 10Giuseppe Lavagetto: pybal: support etcd connections to ports different from 2379 [puppet] - 10https://gerrit.wikimedia.org/r/465142 (https://phabricator.wikimedia.org/T205814) [10:57:33] oh.. on the puppet side, that makes sense [10:58:12] <_joe_> vgutierrez: yes, but we will need to fix the other side too (the nagios check) if we move to using SRV records, like every other app we have [10:58:15] <_joe_> :P [10:59:41] yup [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181008T1100). [11:00:04] rxy, Urbanecm, and _joe_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:16] <_joe_> uh? [11:00:22] xD [11:00:29] <_joe_> I read 13:00 UTC [11:00:41] <_joe_> so I expected it to be in 2 hours [11:00:58] <_joe_> it was 13:00 UTC+2 [11:01:08] 11:00–12:00 UTC # [11:01:16] <_joe_> yeah brainfart on my side [11:04:09] * addshore can not swat [11:04:59] <_joe_> any deployers available? [11:05:15] (03PS2) 10Giuseppe Lavagetto: pybal: support etcd connections to ports different from 2379 [puppet] - 10https://gerrit.wikimedia.org/r/465142 (https://phabricator.wikimedia.org/T205814) [11:05:51] <_joe_> I'm ok with postponing it [11:06:55] (03PS1) 10Elukey: Move _etcd* SRV records to conf100[4-6] where needed [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) [11:07:19] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1001 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001:2379 (min=4) Giuseppe Lavagetto transition to a new etcd cluster and some issues with puppettization. [11:07:19] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=42) Giuseppe Lavagetto transition to a new etcd cluster and some issues with puppettization. [11:07:19] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs3004 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001:2379 (min=8) Giuseppe Lavagetto transition to a new etcd cluster and some issues with puppettization. [11:08:02] (03CR) 10Elukey: [C: 04-1] "Self -1, only RO for now." [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [11:09:16] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12803/seems it DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/465142 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto) [11:11:08] no deployers around? [11:13:30] 10Operations, 10Traffic, 10vm-requests, 10Patch-For-Review: Create VMs for certcentral hosts - https://phabricator.wikimedia.org/T206308 (10Vgutierrez) 05Open>03Resolved VMs delivered, added in puppet as spare systems till certcentral puppetization is ready to go [11:13:33] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) [11:14:48] present [11:14:50] zeljkof, around? [11:14:52] There's swat ongoing :D [11:15:37] (03PS2) 10Elukey: Move _etcd* SRV RO records to conf100[4-6] where needed [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) [11:15:43] Urbanecm: sorry, it's a holiday here, I'm not working [11:16:20] zeljkof, enjoy your holiday, but we'd appreciate somebody with deploy privileges :) [11:17:37] right, I can deploy [11:17:41] Urbanecm: rxy [11:17:47] o/ [11:17:47] Good! [11:17:51] give me a few moments [11:19:51] heh, i havn't used unset with multiple params before [11:20:04] what is unset? [11:20:12] see https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/464890/ [11:20:15] <_joe_> addshore: I can wait though [11:20:16] rxy: your first [11:20:20] _joe_: okay! [11:20:22] (03PS2) 10Addshore: Remove the "reviewer" group at ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464890 (https://phabricator.wikimedia.org/T205997) (owner: 10Rxy) [11:20:23] (03CR) 10Addshore: [C: 032] Remove the "reviewer" group at ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464890 (https://phabricator.wikimedia.org/T205997) (owner: 10Rxy) [11:21:16] wow, fatal monitor is just full of requests taking too long [11:21:25] <_joe_> addshore: oh? [11:21:32] thats the new spammy log spam [11:21:55] (03Merged) 10jenkins-bot: Remove the "reviewer" group at ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464890 (https://phabricator.wikimedia.org/T205997) (owner: 10Rxy) [11:22:00] normally its a mix of stuff, but it's all about requests taking to long now [11:22:03] <_joe_> addshore: 337 in the last 2 hours [11:22:08] (03CR) 10jenkins-bot: Remove the "reviewer" group at ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464890 (https://phabricator.wikimedia.org/T205997) (owner: 10Rxy) [11:22:35] <_joe_> oh no it's more [11:22:47] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991) [11:22:49] <_joe_> yes, not great [11:23:03] (03PS6) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [11:24:13] looks like its been a pretty constant rate all month, just apparently i havn't done a swat in a while [11:24:42] rxy: will you be able to test your patch on mwdebug2002? [11:24:53] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991) [11:26:20] addshore: ok, It work correctly. Please deploy ! [11:26:28] will do [11:27:02] syncing [11:27:20] (03CR) 10Addshore: [C: 032] Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) (owner: 10MarcoAurelio) [11:27:25] (03PS5) 10Addshore: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) (owner: 10MarcoAurelio) [11:27:30] (03CR) 10Addshore: [C: 032] Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) (owner: 10MarcoAurelio) [11:27:55] !log addshore@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: [[gerrit:464890]] Remove the "reviewer" group at ruwikisource T205997 (duration: 00m 57s) [11:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:58] T205997: Remove the "reviewer" user group from ru.wikisource - https://phabricator.wikimedia.org/T205997 [11:27:58] rxy: done [11:28:04] Urbanecm: you are next [11:28:25] thanks. It work correctly at server: mw2233.codfw.wmnet :) [11:28:31] rxy: lovely! [11:29:05] (03Merged) 10jenkins-bot: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) (owner: 10MarcoAurelio) [11:29:23] Urbanecm: is this testable on mwdebug? [11:29:52] ack [11:29:58] its on mwdebug2002 now [11:30:05] should be [11:31:07] (03PS2) 10Addshore: Add throttle exception for Netherlands Hackathon October 2018 - Wiki Techstorm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465047 (https://phabricator.wikimedia.org/T206241) (owner: 10Urbanecm) [11:31:11] (03CR) 10Addshore: [C: 032] Add throttle exception for Netherlands Hackathon October 2018 - Wiki Techstorm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465047 (https://phabricator.wikimedia.org/T206241) (owner: 10Urbanecm) [11:31:38] addshore, works. Please push it to the production and run namespaceDupes.php afterwards [11:31:49] syncing [11:32:40] (03Merged) 10jenkins-bot: Add throttle exception for Netherlands Hackathon October 2018 - Wiki Techstorm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465047 (https://phabricator.wikimedia.org/T206241) (owner: 10Urbanecm) [11:32:43] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:455249]] Use translated MetaNamespace for fy.wiktionary T202769 (duration: 00m 58s) [11:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:46] T202769: Request to rename a namespace on fy.wiktionary.org - https://phabricator.wikimedia.org/T202769 [11:32:58] Urbanecm: with --fix? is there a dry run? [11:33:16] you can invoke dry run by not passing --fix [11:33:28] mwscript namespaceDupes.php --wiki=wiki is dry run [11:33:32] mwscript namespaceDupes.php --wiki=wiki --fix is real run [11:33:49] !log addshore@mwmaint2001:~$ mwscript namespaceDupes.php --wiki fywiktionary # (dryrun, 11529 links to fix, 11529 were resolvable.) [11:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:25] !log addshore@mwmaint2001:~$ mwscript namespaceDupes.php --wiki fywiktionary --fix # Started [11:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:15] !log addshore@mwmaint2001:~$ mwscript namespaceDupes.php --wiki fywiktionary --fix # Finished, still 111 pages to fix [11:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:16] ^^ Urbanecm [11:35:32] can you post the output with 111 pages to fix to the task please addshore ? [11:35:33] apparnently [11:35:35] yup [11:36:32] Urbanecm: done [11:37:25] addshore, thank you. Can you please run the script once again with --add-prefix=T202769 [11:38:29] Urbanecm: can do [11:38:51] thank you [11:39:02] !log addshore@mwmaint2001:~$ mwscript namespaceDupes.php --wiki fywiktionary --fix --add-prefix=T202769 # T202769 [11:39:04] done [11:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:06] T202769: Request to rename a namespace on fy.wiktionary.org - https://phabricator.wikimedia.org/T202769 [11:39:33] Urbanecm: in https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/465047/2/wmf-config/throttle.php your removing a throttle that still has some time left? (a couple of days?) [11:39:51] <_joe_> elukey: I'm going to merge my patch [11:39:54] I looked at the ticket and see that maybe it wasn't actually needed in the fist place? [11:40:22] uh eh, will fix [11:40:29] addshore, what "wasn't needed"? [11:40:50] well, your move one that ends on 2018-10-10T09:00 +2:00 [11:41:04] _joe_ ok [11:41:09] but looking at the phab ticket it looks like the idea of the throttle rule was miss understood anyway? [11:41:17] if you want to add it back please do (in another patch) :) [11:41:57] the ticket for the one that you removed is https://phabricator.wikimedia.org/T203909 [11:42:02] well, it isn't necessary to add, I thought you didn't merge it [11:42:53] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: support etcd connections to ports different from 2379 [puppet] - 10https://gerrit.wikimedia.org/r/465142 (https://phabricator.wikimedia.org/T205814) (owner: 10Giuseppe Lavagetto) [11:43:05] Urbanecm: yup, it is already merged [11:43:17] then I think we can keep it remove [11:43:18] d [11:43:18] Urbanecm: not deployed yet though, I missed it somehow on my first look at the patch [11:43:24] Urbanecm: ack [11:44:12] syncing [11:44:40] (03CR) 10jenkins-bot: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) (owner: 10MarcoAurelio) [11:44:42] (03CR) 10jenkins-bot: Add throttle exception for Netherlands Hackathon October 2018 - Wiki Techstorm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465047 (https://phabricator.wikimedia.org/T206241) (owner: 10Urbanecm) [11:45:04] !log addshore@deploy1001 Synchronized wmf-config/throttle.php: Add throttle exception for Netherlands Hackathon October 2018 - Wiki Techstorm T206241, and remove other rules. (duration: 00m 56s) [11:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:06] T206241: Add throttle exception for Netherlands Hackathon October 2018 - Wiki Techstorm - https://phabricator.wikimedia.org/T206241 [11:45:18] all done [11:45:28] _joe_: were you going to handle yours? [11:45:50] <_joe_> addshore: it can be postponed, I'm in the middle of something else, too [11:45:57] <_joe_> so it can wait for tomorrow or later [11:46:06] <_joe_> I will remove it from the deployment calendar [11:46:10] RECOVERY - PyBal connections to etcd on lvs1001 is OK: OK: 4 connections established with conf1004.eqiad.wmnet:4001 (min=4) [11:46:18] _joe_: ack! [11:46:20] !log SWAT done [11:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:25] thanks for playing everyone! [11:46:32] <_joe_> !log restart pybal on lvs1001 [11:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:59] <_joe_> !log restart pybal in esams, after running puppet, to switch etcd cluster used [12:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:04:10] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:08:00] RECOVERY - PyBal connections to etcd on lvs3004 is OK: OK: 8 connections established with conf1006.eqiad.wmnet:4001 (min=8) [12:08:30] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:13:45] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Dereckson) >>! In T191183#4647075, @Krinkle wrote: > Gerrit wants 100x100px square thumbnails. The 100x100 size isn't what currently happens in the repository: ``` $ identify *.... [12:15:18] (03CR) 10Gehel: [C: 04-1] "See comments inline." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [12:17:10] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:17:36] (03CR) 10Gehel: [C: 04-1] "Looks good, minor comment inline." (032 comments) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [12:23:08] (03PS1) 10Muehlenhoff: Make auto restart for prometheus-rabbitmq-exporter conditional on Debian [puppet] - 10https://gerrit.wikimedia.org/r/465150 [12:25:56] (03CR) 10Muehlenhoff: [C: 032] Make auto restart for prometheus-rabbitmq-exporter conditional on Debian [puppet] - 10https://gerrit.wikimedia.org/r/465150 (owner: 10Muehlenhoff) [12:27:51] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:29:07] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) p:05High>03Normal Setting this to normal now as... [12:29:35] (03PS1) 10Muehlenhoff: Make auto restart for prometheus-pdns-exporter conditional on Debian [puppet] - 10https://gerrit.wikimedia.org/r/465156 [12:31:50] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 42 connections established with conf1004.eqiad.wmnet:4001 (min=42) [12:32:02] (03PS2) 10Muehlenhoff: Make auto restart for prometheus-pdns-exporter conditional on Debian [puppet] - 10https://gerrit.wikimedia.org/r/465156 [12:32:21] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:36:18] (03CR) 10Muehlenhoff: [C: 032] Make auto restart for prometheus-pdns-exporter conditional on Debian [puppet] - 10https://gerrit.wikimedia.org/r/465156 (owner: 10Muehlenhoff) [12:39:39] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=42) [12:40:49] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:43:04] (03PS1) 10Gehel: elasticsearch: mjolnir daemons can be autorestarted [puppet] - 10https://gerrit.wikimedia.org/r/465159 [12:44:40] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 42 connections established with conf1004.eqiad.wmnet:4001 (min=42) [12:44:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Move _etcd* SRV RO records to conf100[4-6] where needed (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [12:46:57] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/405738 (https://phabricator.wikimedia.org/T184386) (owner: 10Kaganer) [12:49:41] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:49:48] !log pt-kill-wmf enabled on the wikireplicas (T203674) [12:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:51] T203674: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 [12:51:19] (03PS3) 10Elukey: Move _etcd* SRV RO records to conf100[4-6] where needed [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) [12:54:24] 10Operations, 10ops-eqiad, 10Analytics: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) This is one of the issues we have with leasing.....Dell has it so Farnam is the owner not us. I think it's sorted now and attempting to get it resolved. [12:55:59] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) 05Open>03Resolved [12:56:23] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ema) cp1081 and cp1079, both on asw2-b-eqiad, are also having IPv6 connectivity issues with lvs1001: ``` 12:53:09 ema@lvs1001.wikimedia.org:~ $ curl http://localhost:9090/pools/tex... [12:56:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "sorry, one more interation (my bad)" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [13:01:29] (03PS4) 10Elukey: Move _etcd* SRV RO records to conf100[4-6] where needed [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) [13:02:19] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:04:14] (03CR) 10Giuseppe Lavagetto: [C: 031] Move _etcd* SRV RO records to conf100[4-6] where needed [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [13:04:36] (03PS1) 10Ema: Revert "Revert "traffic: Depool eqiad from user traffic for switchover"" [dns] - 10https://gerrit.wikimedia.org/r/465161 [13:04:36] !log downtime notifications for dbstore1002 repliaction threads (T205544) [13:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:40] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [13:05:05] (03PS1) 10BBlack: Revert "Revert "traffic: Depool eqiad from user traffic for switchover"" [dns] - 10https://gerrit.wikimedia.org/r/465162 [13:05:10] lol [13:05:16] <_joe_> ahaha [13:05:18] (03CR) 10Elukey: "There should also be role::analytics_cluster::coordinator to complete the picture :)" [puppet] - 10https://gerrit.wikimedia.org/r/465137 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [13:05:20] you win the race, I had to fetch my yubikey :P [13:05:26] <_joe_> I vote for the ema version [13:05:36] <_joe_> it looks more correct [13:05:38] !log converting cebwiki.templatelinks to TokuDB on host dbstore1002.eqiad.wmnet (T205544) [13:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:53] haha yes! [13:06:05] (03Abandoned) 10BBlack: Revert "Revert "traffic: Depool eqiad from user traffic for switchover"" [dns] - 10https://gerrit.wikimedia.org/r/465162 (owner: 10BBlack) [13:06:42] (03PS2) 10Ema: Revert "Revert "traffic: Depool eqiad from user traffic for switchover"" [dns] - 10https://gerrit.wikimedia.org/r/465161 (https://phabricator.wikimedia.org/T201039) [13:07:28] (03CR) 10Elukey: [C: 032] Move _etcd* SRV RO records to conf100[4-6] where needed [dns] - 10https://gerrit.wikimedia.org/r/465143 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [13:07:59] ema: should I go with --^ or wait for your revert ? [13:08:05] before the authdns update I mean [13:08:22] <_joe_> elukey: I think it's safe to do [13:08:33] it doesn't technically matter, but may as well not reload everything twice and just push it together [13:08:39] <_joe_> it won't cause so many connections to actually switch immeiately [13:08:47] <_joe_> but that ^^ [13:09:06] yep yep I know that those are not related just wanted to know if we wanted to couple the two changes :) [13:09:10] depooling now if we're all ok with that [13:09:14] +1 [13:09:17] <_joe_> as we know, confd needs a restart to pick up the change, meh [13:09:19] !log depool eqiad front-edge traffic T201039 [13:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:22] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [13:09:23] <_joe_> ema: +1 [13:09:42] (03CR) 10Ema: [C: 032] Revert "Revert "traffic: Depool eqiad from user traffic for switchover"" [dns] - 10https://gerrit.wikimedia.org/r/465161 (https://phabricator.wikimedia.org/T201039) (owner: 10Ema) [13:09:46] (03PS3) 10Ema: Revert "Revert "traffic: Depool eqiad from user traffic for switchover"" [dns] - 10https://gerrit.wikimedia.org/r/465161 (https://phabricator.wikimedia.org/T201039) [13:10:04] <_joe_> ema: lemme know when you're done [13:10:30] elukey: merging your changes too [13:10:35] ack! [13:10:57] (03CR) 10Muehlenhoff: [C: 031] "Looks good, manual test on elastic1017 was also fine." [puppet] - 10https://gerrit.wikimedia.org/r/465159 (owner: 10Gehel) [13:11:05] _joe_: authdns-update done [13:11:12] it takes 10 mins for the depool to be effective for clients (at least, the majority of them, but not the RFC-violating ones :P) [13:11:14] <_joe_> ema: thanks! [13:11:54] <_joe_> !log purging the dnsrec cache for eqiad,esams etcd client SRV records [13:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:48] (03CR) 10Ema: [C: 032] cache: enable varnish::logging::backendtiming [puppet] - 10https://gerrit.wikimedia.org/r/465136 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema) [13:12:56] (03PS2) 10Ema: cache: enable varnish::logging::backendtiming [puppet] - 10https://gerrit.wikimedia.org/r/465136 (https://phabricator.wikimedia.org/T131894) [13:13:27] <_joe_> oh we will have the usual mediawiki alerts when we switch etcd [13:14:18] _joe_: I don't think so this time, it should be just a warning I think [13:14:20] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1221 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:14:20] PROBLEM - MediaWiki EtcdConfig up-to-date on mwdebug1001 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:14:30] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1347 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:14:49] _joe_ wins [13:14:49] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1348 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:14:50] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1264 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:15:11] <_joe_> volans: somehow the "master one" is still on codfw [13:15:19] mmmh [13:15:29] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1285 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:15:29] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1273 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:15:30] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1265 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:15:30] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1325 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:15:30] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1222 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:15:30] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1230 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:15:49] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1244 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:15:49] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1251 is CRITICAL: etcd last index (1598) is outdated compared to the master one (302926) [13:16:17] <_joe_> volans: I assume that value gets cached? [13:16:30] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1285 is OK: etcd last index (1598) matches the master one (1598) [13:16:30] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1221 is OK: etcd last index (1598) matches the master one (1598) [13:16:34] yes on file but there is a timer [13:16:37] that runs and update it [13:16:38] cat /var/run/icinga/etcd_mw_config_lastindex_eqiad [13:16:38] 1598 [13:16:39] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1273 is OK: etcd last index (1598) matches the master one (1598) [13:16:39] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1265 is OK: etcd last index (1598) matches the master one (1598) [13:16:39] RECOVERY - MediaWiki EtcdConfig up-to-date on mwdebug1001 is OK: etcd last index (1598) matches the master one (1598) [13:16:39] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1325 is OK: etcd last index (1598) matches the master one (1598) [13:16:40] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1222 is OK: etcd last index (1598) matches the master one (1598) [13:16:40] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1230 is OK: etcd last index (1598) matches the master one (1598) [13:16:49] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1347 is OK: etcd last index (1598) matches the master one (1598) [13:16:50] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1244 is OK: etcd last index (1598) matches the master one (1598) [13:16:50] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1251 is OK: etcd last index (1598) matches the master one (1598) [13:17:00] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1348 is OK: etcd last index (1598) matches the master one (1598) [13:17:00] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1264 is OK: etcd last index (1598) matches the master one (1598) [13:17:02] <_joe_> volans: ok, it was updated :P [13:18:21] <_joe_> volans: sadly we can't avoid this alert from firing when we switch to another etcd cluster [13:18:30] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 133.5 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [13:18:57] _joe_: do we plan to do this often? [13:19:03] <_joe_> no [13:19:19] <_joe_> like once a semester, tops [13:19:43] then I guess it's acceptable to have ~10 spam alarms [13:20:28] <_joe_> yes [13:20:29] _joe_: the way to reduce it is to manually for a run on einsteinium right after changing the DNS and wiping the cache [13:20:34] <_joe_> yeah [13:21:07] sudo systemctl start update-etcd-mw-config-lastindex.service [13:21:09] should do it [13:21:27] it runs every 30 seconds now [13:21:30] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 919.90 seconds [13:21:42] ^ That was me [13:21:57] I already muted, so I have to check why it shown up [13:22:45] banyek: nothing looks downtimed on dbstore1002 apparently [13:24:14] I downtimed it now, I just selected the pull down, but didn't clicked submit :/ [13:32:40] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:32:41] !log restart confd on cp1* to pick up new srv records [13:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:05] (03PS1) 10Filippo Giunchedi: logstash: add ipv6 to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/465164 (https://phabricator.wikimedia.org/T206454) [13:36:07] (03PS1) 10Filippo Giunchedi: logstash: move to /srv/elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) [13:36:09] (03PS1) 10Filippo Giunchedi: WIP: new Kafka cluster logging-main [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) [13:36:12] (03PS1) 10Filippo Giunchedi: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) [13:36:50] PROBLEM - DPKG on vega is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:37:35] !log restart confd on all the other eqiad nodes to pick up new srv records [13:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:49] (03CR) 10jerkins-bot: [V: 04-1] site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [13:37:59] RECOVERY - DPKG on vega is OK: All packages OK [13:39:15] !log Enable gtid on the following slaves: db2068 db1122 db1117:3323 [13:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:19] jynus banyek ^ [13:39:42] thanks [13:41:41] !log restart navtiming.service on webperf1001 to pick up the dns change for etcd [13:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:43] (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [13:43:23] !log restart confd on esams nodes to pick up new srv settings [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:27] 10Operations, 10Traffic: Provide a Let's Encrypt ACME v2 staging environment account - https://phabricator.wikimedia.org/T206461 (10Vgutierrez) p:05Triage>03Normal [13:44:27] (03CR) 10DCausse: [C: 031] elasticsearch: mjolnir daemons can be autorestarted [puppet] - 10https://gerrit.wikimedia.org/r/465159 (owner: 10Gehel) [13:45:32] (03PS7) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [13:51:44] * Krenair pokes wikibugs [13:53:55] (03PS4) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) [13:53:55] (03PS40) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [13:53:55] (03CR) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [13:53:55] (03CR) 10Mathew.onipe: [C: 031] elasticsearch: mjolnir daemons can be autorestarted [puppet] - 10https://gerrit.wikimedia.org/r/465159 (owner: 10Gehel) [13:53:55] (03PS2) 10Filippo Giunchedi: WIP: new Kafka cluster logging-main [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) [13:53:55] (03PS2) 10Filippo Giunchedi: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) [13:54:03] woah [13:54:11] (03PS2) 10Gehel: elasticsearch: mjolnir daemons can be autorestarted [puppet] - 10https://gerrit.wikimedia.org/r/465159 [13:54:18] okay looks like my IRC connection is slow today [13:54:52] I believe wikibugs is slow today Krenair [13:55:06] right [13:55:31] (03CR) 10Gehel: [C: 032] elasticsearch: mjolnir daemons can be autorestarted [puppet] - 10https://gerrit.wikimedia.org/r/465159 (owner: 10Gehel) [13:56:23] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1110 for testing s3 imports" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465168 [13:56:29] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1110 for testing s3 imports" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465168 [13:58:37] (03CR) 10Jcrespo: [C: 031] Revert "mariadb: Depool db1110 for testing s3 imports" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465168 (owner: 10Jcrespo) [13:59:48] (03PS1) 10Ema: cache: fix varnishbackendtiming destination directory [puppet] - 10https://gerrit.wikimedia.org/r/465170 (https://phabricator.wikimedia.org/T131894) [14:00:19] zuul broken, is someone working on that? [14:01:30] (03CR) 10Banyek: [C: 032] Revert "mariadb: Depool db1110 for testing s3 imports" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465168 (owner: 10Jcrespo) [14:03:38] oh, apparently has been broken for 3 days, so not new, just some downtime expired [14:03:43] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T184805: Revert 'mariadb: Depool db1110 for testing s3 imports' (duration: 00m 56s) [14:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:48] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [14:04:03] ^ I have to redo it, as I forgot to fetch/rebase [14:04:12] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1110 for testing s3 imports" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465168 (owner: 10Jcrespo) [14:04:17] (03PS5) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [14:05:10] (03PS6) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [14:05:26] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T184805: Revert 'mariadb: Depool db1110 for testing s3 imports' (duration: 00m 57s) [14:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:21] (03CR) 10Gehel: "@Stas: I'm not entirely happy with the added complexity here, but I think it does the job. It still needs a related change to the blazegra" [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [14:14:21] semi sync master enabled on s1 (db1067) [14:14:26] semi sync master enabled on s2 (db10676) [14:14:30] semi sync master enabled on s2 (db1066) [14:15:05] semi sync master enabled on s3 (db1070) [14:15:31] semi sync master enabled on s4 (db1068) [14:16:11] <_joe_> without !log that won't get recorded [14:16:38] not sure what he is trying to do, but masters should have that enabled already [14:16:41] (03PS3) 10Filippo Giunchedi: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) [14:18:12] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:18:32] (03CR) 10Mathew.onipe: wdqs: cleanup logback configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [14:19:16] (03PS1) 10Vgutierrez: secret: Add dummy LE ACMEv2 staging private key [labs/private] - 10https://gerrit.wikimedia.org/r/465171 (https://phabricator.wikimedia.org/T206461) [14:19:29] (03PS3) 10Filippo Giunchedi: WIP: new Kafka cluster logging-main [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) [14:19:31] (03PS4) 10Filippo Giunchedi: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) [14:20:54] (03CR) 10Alex Monk: [C: 04-1] "needs rebase and ferm::service allowing port 22 in from certcentral hosts" [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [14:21:51] (03PS7) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [14:22:06] (03CR) 10Gehel: wdqs: cleanup logback configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [14:23:21] (03CR) 10Ema: [C: 032] cache: fix varnishbackendtiming destination directory [puppet] - 10https://gerrit.wikimedia.org/r/465170 (https://phabricator.wikimedia.org/T131894) (owner: 10Ema) [14:23:57] https://www.irccloud.com/pastebin/a8pk7Qr3/ [14:24:05] (03PS1) 10Ema: prometheus: ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/465172 (https://phabricator.wikimedia.org/T202381) [14:24:11] (03PS10) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) [14:24:23] (03PS1) 10Marostegui: Revert "mariadb: Switch db1122 binlog format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/465173 [14:25:08] 10Operations, 10Wikimedia-Mailing-lists: Request for a mailing list for VVIT WikiConnect - https://phabricator.wikimedia.org/T191702 (10Dzahn) 05Open>03Resolved I added the new address you provided and the original secondary admin, so now you have "Vvitwikiconnect list run by kcvelaga at wikipedia.de, kcv... [14:27:28] (03CR) 10Filippo Giunchedi: "PCC is happy (for the whole chain of changes, including moving /var/lib/elasticsearch to /srv/elasticsearch)" [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [14:27:47] (03Abandoned) 10Marostegui: Revert "mariadb: Switch db1122 binlog format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/465173 (owner: 10Marostegui) [14:29:22] (03PS1) 10Marostegui: db1122.yaml: Remove STATEMENT based [puppet] - 10https://gerrit.wikimedia.org/r/465174 [14:30:00] (03PS1) 10Muehlenhoff: Update db cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/465175 [14:30:44] (03CR) 10Ema: [C: 031] secret: Add dummy LE ACMEv2 staging private key [labs/private] - 10https://gerrit.wikimedia.org/r/465171 (https://phabricator.wikimedia.org/T206461) (owner: 10Vgutierrez) [14:32:03] (03CR) 10Jcrespo: [C: 031] db1122.yaml: Remove STATEMENT based [puppet] - 10https://gerrit.wikimedia.org/r/465174 (owner: 10Marostegui) [14:32:13] (03CR) 10Marostegui: [C: 032] db1122.yaml: Remove STATEMENT based [puppet] - 10https://gerrit.wikimedia.org/r/465174 (owner: 10Marostegui) [14:32:15] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wiktionary.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:33:56] (03PS1) 10Jayprakash12345: Enable Extension:File exporter to mrwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465176 [14:35:28] (03PS11) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) [14:35:29] <_joe_> sigh [14:35:33] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:37:22] (03PS2) 10Jayprakash12345: Enable Extension:File exporter to mrwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465176 (https://phabricator.wikimedia.org/T206437) [14:37:37] 10Operations, 10Wikimedia-Site-requests, 10Commons: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751 (10Reedy) [14:38:39] (03CR) 10Vgutierrez: [C: 032] secret: Add dummy LE ACMEv2 staging private key [labs/private] - 10https://gerrit.wikimedia.org/r/465171 (https://phabricator.wikimedia.org/T206461) (owner: 10Vgutierrez) [14:38:53] (03CR) 10Vgutierrez: [V: 032 C: 032] secret: Add dummy LE ACMEv2 staging private key [labs/private] - 10https://gerrit.wikimedia.org/r/465171 (https://phabricator.wikimedia.org/T206461) (owner: 10Vgutierrez) [14:39:03] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10ema) We now expose the following metrics to Prometheus: ``` 14:35:31 ema@cp2004.codfw.wmnet:~ $ curl -s http://localhost:3904/metrics... [14:39:34] (03CR) 10Filippo Giunchedi: "At merge time the idea is to disable puppet on the affected nodes, and one by one:" [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [14:40:39] (03PS4) 10Filippo Giunchedi: New Kafka cluster logging-main [puppet] - 10https://gerrit.wikimedia.org/r/465166 (https://phabricator.wikimedia.org/T206454) [14:40:41] (03PS5) 10Filippo Giunchedi: site: enable logging Kafka on Logstash nodes [puppet] - 10https://gerrit.wikimedia.org/r/465167 (https://phabricator.wikimedia.org/T206454) [14:45:21] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for coal [puppet] - 10https://gerrit.wikimedia.org/r/465180 (https://phabricator.wikimedia.org/T135991) [14:46:00] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for coal [puppet] - 10https://gerrit.wikimedia.org/r/465180 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:46:37] 10Operations, 10Wikimedia-Site-requests, 10Commons: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T192751 (10Reedy) 05Open>03Resolved a:03Reedy ``` reedy@mwmaint2001:/tmp/uploads$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=고려 /tmp/uploads I... [14:49:01] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for coal [puppet] - 10https://gerrit.wikimedia.org/r/465180 (https://phabricator.wikimedia.org/T135991) [14:56:59] 10Operations, 10Traffic, 10Patch-For-Review: Provide a Let's Encrypt ACME v2 staging environment account - https://phabricator.wikimedia.org/T206461 (10Vgutierrez) private key committed into our private repo. [14:57:19] 10Operations, 10Traffic: Provide a Let's Encrypt ACME v2 staging environment account - https://phabricator.wikimedia.org/T206461 (10Vgutierrez) [14:58:35] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) I re-examined the problem from a fresh start, and also tried to validate Joe... [15:00:23] (03PS9) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) [15:00:25] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert donate.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462479 (https://phabricator.wikimedia.org/T196968) [15:00:27] (03PS4) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikinews.org [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) [15:00:29] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikisource.org [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) [15:00:31] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikibooks.org [puppet] - 10https://gerrit.wikimedia.org/r/462487 (https://phabricator.wikimedia.org/T196968) [15:00:33] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert vote.w.o [puppet] - 10https://gerrit.wikimedia.org/r/462492 (https://phabricator.wikimedia.org/T196968) [15:00:35] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462493 (https://phabricator.wikimedia.org/T196968) [15:00:37] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/462494 (https://phabricator.wikimedia.org/T196968) [15:00:39] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/462495 (https://phabricator.wikimedia.org/T196968) [15:01:54] * elukey hides [15:06:27] (03PS2) 10Ema: prometheus: ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/465172 (https://phabricator.wikimedia.org/T202381) [15:09:46] (03PS3) 10Ema: prometheus: ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/465172 (https://phabricator.wikimedia.org/T202381) [15:12:46] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/465172 (https://phabricator.wikimedia.org/T202381) (owner: 10Ema) [15:13:21] (03CR) 10Ema: [C: 032] prometheus: ATS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/465172 (https://phabricator.wikimedia.org/T202381) (owner: 10Ema) [15:18:22] 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) [15:18:44] 10Operations, 10Traffic, 10Patch-For-Review: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) 05Open>03Resolved [15:23:36] (03CR) 10Giuseppe Lavagetto: [C: 031] "Puppet compiler at https://puppet-compiler.wmflabs.org/compiler1002/12813/mwdebug2002.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:26:04] (03CR) 10Jcrespo: [C: 04-1] "Aside from the mistake bellow, all ok. Maintaining roles may be impossible- we should be thinking on maintaining profile-based señectopm i" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465175 (owner: 10Muehlenhoff) [15:28:59] (03CR) 10WMDE-Fisch: [C: 04-1] "We're good to go, just on remark on the setting." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465176 (https://phabricator.wikimedia.org/T206437) (owner: 10Jayprakash12345) [15:29:24] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/462478 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:41:51] (03PS3) 10Jayprakash12345: Enable Extension:File exporter to mrwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465176 (https://phabricator.wikimedia.org/T206437) [15:42:24] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review, 10Wikimedia-Incident: Collect Backend-Timing in Prometheus - https://phabricator.wikimedia.org/T131894 (10Gilles) https://grafana.wikimedia.org/dashboard/db/apache-backend-timing getting something started there... [15:47:39] (03PS1) 10Filippo Giunchedi: WIP: define haproxy for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465185 (https://phabricator.wikimedia.org/T187765) [15:47:41] (03CR) 10WMDE-Fisch: [C: 031] Enable Extension:File exporter to mrwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465176 (https://phabricator.wikimedia.org/T206437) (owner: 10Jayprakash12345) [15:52:10] (03PS2) 10Filippo Giunchedi: WIP: define haproxy service for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465185 (https://phabricator.wikimedia.org/T187765) [15:52:48] (03CR) 10Filippo Giunchedi: "PCC is happy https://puppet-compiler.wmflabs.org/compiler1002/12816/" [puppet] - 10https://gerrit.wikimedia.org/r/465185 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [16:11:26] (03CR) 10Elukey: "Two nits but it looks good!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464732 (https://phabricator.wikimedia.org/T206020) (owner: 10Nuria) [16:27:49] 10Operations, 10netops, 10Patch-For-Review: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Working with JTAC on this. Here is a tcpdump capture of a neighbor solicitation packet being sent from lvs1002: ``` lvs1002:~$ sudo tcpdump -p -i eth1... [16:28:46] !log restart eventlogging on eventlog1002 for python security upgrades [16:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:31] !log push firewall filter counters on asw2-b-eqiad - T201039 [16:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:35] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [16:31:24] (03PS2) 10Elukey: Rotate logs in refinery based on time rather than size [puppet] - 10https://gerrit.wikimedia.org/r/464732 (https://phabricator.wikimedia.org/T206020) (owner: 10Nuria) [16:31:41] (03PS3) 10Elukey: Rotate logs in refinery based on time rather than size [puppet] - 10https://gerrit.wikimedia.org/r/464732 (https://phabricator.wikimedia.org/T206020) (owner: 10Nuria) [16:36:25] (03CR) 10Elukey: [C: 032] Rotate logs in refinery based on time rather than size [puppet] - 10https://gerrit.wikimedia.org/r/464732 (https://phabricator.wikimedia.org/T206020) (owner: 10Nuria) [16:45:34] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10mforns) @BBlack ping, bumping this up [16:47:12] (03CR) 10Gehel: [C: 031] "> Patch Set 1:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465165 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [16:48:10] (03CR) 10Gehel: [C: 04-1] prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [16:48:12] (03PS1) 10Gergő Tisza: Enable MCR read-new mode on some small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465194 (https://phabricator.wikimedia.org/T198308) [16:51:52] (03CR) 10Daniel Kinzler: [C: 031] "We want this, and it looks correct." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465194 (https://phabricator.wikimedia.org/T198308) (owner: 10Gergő Tisza) [16:53:09] PROBLEM - High lag on wdqs2002 is CRITICAL: 3620 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:56:36] (03CR) 10Gehel: [C: 04-1] wdqs: auto deployment of wdqs on wdqs1009 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [16:58:41] 10Operations, 10netops, 10Patch-For-Review: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Followed up with JTAC, we can see the NS packets making it into the fabric: ``` # run show firewall Filter: v6-ns-lvs1002-ge-6/0/46.0-i... [16:58:50] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:09:30] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 46.98 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:14:59] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 77.18 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:16:00] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:24:55] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Fundraising: Frdev1001 server and mysql access - https://phabricator.wikimedia.org/T206478 (10Krenair) pgehres hasn't worked here for years. anyway, adding this to #SRE-Access-Requests and shifting the fundraising project from subscribers to tags [17:31:06] (03PS1) 10Volans: etcd-config: add check for directory [puppet] - 10https://gerrit.wikimedia.org/r/465197 (https://phabricator.wikimedia.org/T199413) [17:31:09] RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 79 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:55:37] (03CR) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) (owner: 10Mathew.onipe) [17:56:54] (03PS5) 10Mathew.onipe: prometheus-blazegraph-exporter: added Query and Concurrency related counters [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/464854 (https://phabricator.wikimedia.org/T206123) [18:00:05] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181008T1800) [18:00:05] Jayprakash12345 and tgr: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:45] I am here \o/ [18:00:46] o/ [18:01:53] anyone doing the SWAT? looks like jouncebot is not pinging deployers anymore [18:03:32] I'll do it then [18:07:19] uh, deployment.eqiad.wmnet points to deploy1001. Is that normal? [18:07:34] Alright, Wait for 5 minutes, then You should go ahead. [18:10:42] (03CR) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [18:11:03] mutante: around? [18:13:56] tgr: Go ahead [18:13:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:14:46] I'm trying to figure out which deployment host to use [18:16:16] greg-g: thcipriani: twentyafterfour: any of you around? [18:16:36] I guess US SWAT window on US holiday is not the best supported period [18:17:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:18:35] 10Operations, 10Analytics, 10Analytics-Cluster: Manage Hue via systemd unit - https://phabricator.wikimedia.org/T206484 (10MoritzMuehlenhoff) [18:19:45] (03PS8) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [18:20:31] (03CR) 10Gergő Tisza: [C: 032] Enable Extension:File exporter to mrwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465176 (https://phabricator.wikimedia.org/T206437) (owner: 10Jayprakash12345) [18:20:55] (03CR) 10BryanDavis: "> > Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [18:22:19] (03Merged) 10jenkins-bot: Enable Extension:File exporter to mrwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465176 (https://phabricator.wikimedia.org/T206437) (owner: 10Jayprakash12345) [18:23:11] (03CR) 10Muehlenhoff: [C: 031] "One nit, but looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465197 (https://phabricator.wikimedia.org/T199413) (owner: 10Volans) [18:23:40] 10Operations, 10Analytics, 10Analytics-Cluster, 10User-Elukey: Manage Hue via systemd unit - https://phabricator.wikimedia.org/T206484 (10elukey) [18:23:51] Jayprakash12345: please test on mwdebug1002 [18:24:07] ok [18:24:42] (03PS2) 10Volans: etcd-config: add check for directory [puppet] - 10https://gerrit.wikimedia.org/r/465197 (https://phabricator.wikimedia.org/T199413) [18:24:46] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Krinkle) >>! In T191183#4648126, @Paladox wrote: > [..] using phabricator would not work seeing as the name of the file does not match the users username. Can you explain what you... [18:24:51] (03CR) 10Volans: "done" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465197 (https://phabricator.wikimedia.org/T199413) (owner: 10Volans) [18:25:58] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10MoritzMuehlenhoff) Ah, I missed the private.git commit, that's good enough. [18:26:24] tgr: Database locked [18:26:40] The database is currently locked to new entries and other modifications, probably for routine database maintenance, after which it will be back to normal. [18:26:41] The system administrator who locked it offered this explanation: You can't edit now. This is because of maintenance. Copy and save your text and try again in a few minutes. [18:27:09] tgr: at mwdebug1002 [18:27:10] uh [18:27:17] that's not scheduled, right? [18:27:43] means? [18:28:15] tgr: use hosts in codfw, as is the active datacenter [18:28:21] databases are RO in eqiad [18:28:26] tgr: Extention enable, I can see in Special:Version [18:28:51] volans: so deploy host is in eqiad but debug host should be in codfw? [18:28:56] But Can't access the beta-prefences [18:29:25] tgr: yes AFAIK, the deploy host can be switched independently of mediawiki and was not switched for this switchover [18:30:21] ok, thanks. Jayprakash12345, please use mwdebug2002 then. [18:32:20] (03CR) 10Muehlenhoff: Update db cumin aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465175 (owner: 10Muehlenhoff) [18:32:26] (03PS2) 10Muehlenhoff: Update db cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/465175 [18:33:07] tgr: Looks good. Run stashBot. [18:34:20] (03CR) 10jenkins-bot: Enable Extension:File exporter to mrwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465176 (https://phabricator.wikimedia.org/T206437) (owner: 10Jayprakash12345) [18:36:11] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:465176|Enable Extension:File exporter to mrwikipedia (T206437)]] (duration: 00m 57s) [18:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:16] T206437: Install Extension:File exporter to marathi wikipedia - https://phabricator.wikimedia.org/T206437 [18:36:24] Jayprakash12345: deployed [18:36:45] tgr: Thanks, Have a good day. [18:36:45] (03CR) 10Gergő Tisza: [C: 032] Enable MCR read-new mode on some small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465194 (https://phabricator.wikimedia.org/T198308) (owner: 10Gergő Tisza) [18:37:05] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@bd698bd]: WDQS test deployment - New federation whitelist entries(wdqs1009) [18:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:38] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@bd698bd]: WDQS test deployment - New federation whitelist entries(wdqs1009) (duration: 00m 33s) [18:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:27] uh [18:44:39] gate-and-submit success, but it did not get merged? [18:45:32] (03CR) 10Gergő Tisza: [C: 032] "Let's try that again, Zuul must have fallen asleep." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465194 (https://phabricator.wikimedia.org/T198308) (owner: 10Gergő Tisza) [18:45:45] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@bd698bd]: WDQS deployment - New federation whitelist entries [18:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:49] (03PS2) 10Gergő Tisza: Enable MCR read-new mode on some small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465194 (https://phabricator.wikimedia.org/T198308) [18:50:50] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [18:51:50] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [18:54:18] (03CR) 10Gergő Tisza: [C: 032] Enable MCR read-new mode on some small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465194 (https://phabricator.wikimedia.org/T198308) (owner: 10Gergő Tisza) [18:55:51] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@bd698bd]: WDQS deployment - New federation whitelist entries (duration: 10m 07s) [18:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:58] (03Merged) 10jenkins-bot: Enable MCR read-new mode on some small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465194 (https://phabricator.wikimedia.org/T198308) (owner: 10Gergő Tisza) [19:00:10] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:465194|Enable MCR read-new mode on some small wikis (T198308)]] (duration: 00m 56s) [19:00:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:14] T198308: Enable MCR migration stage "write both, read new" on live systems - https://phabricator.wikimedia.org/T198308 [19:01:48] (03CR) 10jenkins-bot: Enable MCR read-new mode on some small wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465194 (https://phabricator.wikimedia.org/T198308) (owner: 10Gergő Tisza) [19:04:09] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Paladox) @Krinkle would you know how to build this? Seeing as you can link your mediawiki profile or your ldap account, you could have a different name. [19:06:30] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Krinkle) >>! In T191183#4650347, @Paladox wrote: > @Krinkle would you know how to build this? Seeing as you can link your mediawiki profile or your ldap account, you could have a d... [19:08:48] !log depooling wdqs2003 to catch up on lag -T206423 [19:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:52] T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 [19:13:57] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Paladox) Ok, but would you know how to write this cgi script? [19:22:05] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [19:22:09] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [19:23:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [19:29:01] bblack: ^ [19:41:50] !log troubleshooting asw2-b-eqid with JTAC - T201039 [19:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:59] T201039: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 [19:44:04] 10Operations, 10netops, 10Patch-For-Review: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Temporarily disable IGMP snooping on the interfaces to narrow down the issue. ```lang=diff [edit protocols igmp-snooping vlan all] + interface ge-6... [19:52:04] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [19:52:08] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [19:53:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [20:10:35] 10Operations, 10netops, 10Patch-For-Review: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) >>! In T201039#4649384, @ema wrote: > cp1081 and cp1079, both on asw2-b-eqiad, are also having IPv6 connectivity issues with lvs1001: > I can ping th... [20:13:05] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [20:13:08] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [20:13:12] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [20:20:34] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Legoktm) Debian Buster is going to release with PHP 7.3 rather than 7.2 (per @MoritzMuehlenhoff and https://bugs.debi... [20:42:26] !log repooling wdqs2003 catched up on lag - T206423 [20:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:31] T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn - https://phabricator.wikimedia.org/T206423 [21:00:05] bawolff and Reedy: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181008T2100). [21:01:55] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Krinkle) Yes, I think anyone involved around this task could do it. It's not a question of how. The question is, what do we want for the user experience, and would it be worth it... [21:18:49] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 145.1 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [21:23:12] (03PS1) 10Zoranzoki21: Add throttle rule and remove outdated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465283 (https://phabricator.wikimedia.org/T206408) [21:27:04] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [21:27:08] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [21:28:01] come on FB, not again... [21:28:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [21:29:22] XioNoX, those alerts are due to FB traffic? [21:29:41] Krenair: yeah... [21:29:55] Krenair: https://phabricator.wikimedia.org/T192688 [21:30:05] different kind of requests though iirc [21:31:36] I can't view that but seriously [21:32:20] * Reedy blames Facebooks new spy device [21:33:02] they are saturating our peering port in Dallas [21:33:25] How have they not yet been blocked? [21:35:20] we managed to mitigate it previously by sending them a longer cache header when the request was coming from a specific IP/UA, but I think they changed their UA [21:35:58] bblack looked into it last Friday [21:36:21] to something more easily blockable or less easily blockable? [21:37:08] just different, no difficulty change I think [21:37:12] meh [21:38:00] first alert was on a Saturday, then a Friday evening, and now twice in a US holiday... [21:39:27] Krenair: that was the mitigation if you're curious: https://gerrit.wikimedia.org/r/c/operations/puppet/+/427821 [21:40:05] this is from april [21:40:12] (03PS1) 10Pmiazga: Beta: enable errors counting via statsv by Minerva skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465284 (https://phabricator.wikimedia.org/T205582) [21:40:16] it's just come back since april? [21:40:22] yep [21:40:38] and I take it we have no contact for this traffic [21:40:44] (03CR) 10Pmiazga: [C: 032] Beta: enable errors counting via statsv by Minerva skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465284 (https://phabricator.wikimedia.org/T205582) (owner: 10Pmiazga) [21:41:33] nop, that's what part of that task is about, communicating with Facebook :) [21:42:15] well [21:42:17] I reached to their network team, that can only do things like setting up a direct link between our two networks for example [21:42:20] good luck [21:42:23] (03Merged) 10jenkins-bot: Beta: enable errors counting via statsv by Minerva skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465284 (https://phabricator.wikimedia.org/T205582) (owner: 10Pmiazga) [21:43:02] sigh [21:44:01] shouldn't a team like that be able to find out exactly who is sending the requests and why the volume of traffic is so high? [21:46:36] who is easy, it's fb :P [21:47:34] I also think there are people at Wiki in charge of relationships with them and others that should be able to help [21:47:55] Get domas to fix it [21:48:02] ^ [21:57:05] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [21:57:08] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [21:58:05] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% [21:58:43] (03CR) 10jenkins-bot: Beta: enable errors counting via statsv by Minerva skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465284 (https://phabricator.wikimedia.org/T205582) (owner: 10Pmiazga) [22:18:04] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [22:18:08] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [22:18:12] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [22:20:47] (03PS1) 10Ayounsi: Revert "SNMP: set snmp-mibs-downloader BASEDIR to Debian 9 standard" [puppet] - 10https://gerrit.wikimedia.org/r/465290 [22:23:54] (03CR) 10Ayounsi: [C: 032] Revert "SNMP: set snmp-mibs-downloader BASEDIR to Debian 9 standard" [puppet] - 10https://gerrit.wikimedia.org/r/465290 (owner: 10Ayounsi) [22:24:03] (03PS2) 10Ayounsi: Revert "SNMP: set snmp-mibs-downloader BASEDIR to Debian 9 standard" [puppet] - 10https://gerrit.wikimedia.org/r/465290 [22:41:27] !log clear BGP neighbor cr1-eqsin:AS9583 (bgp limit threshold reached) [22:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:30] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:45:34] !log increase accepted-prefix-limit for 24115 on cr4-ulsfo [22:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:50] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181008T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:28:49] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10MaxSem) I did some testing on 7.3 a while ago, besides a couple obvious simple bugs, there was just too much noise fro... [23:55:22] * Krinkle staging on mwdebug2001