[00:21:52] PROBLEM - Device not healthy -SMART- on db2052 is CRITICAL: cluster=mysql device=cciss,1 instance=db2052:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2052&var-datasource=codfw%2520prometheus%252Fops [00:32:37] (03PS27) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [00:33:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [00:37:23] (03CR) 10EBernhardson: "this patch is already getting too big, I'm going to split a few parts out that can be applied on their own ahead of time to simplify thing" [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [01:10:23] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 417.39 seconds [01:12:13] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 418.71 seconds [01:15:13] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.98 seconds [01:15:52] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.96 seconds [01:16:03] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.19 seconds [01:16:12] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.53 seconds [01:16:13] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.49 seconds [01:21:02] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.13 seconds [01:32:06] (03PS28) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [01:32:08] (03PS1) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 [01:33:07] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [01:33:13] (03CR) 10jerkins-bot: [V: 04-1] Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [02:12:02] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 383.44 seconds [02:18:32] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 20.93 seconds [03:20:23] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [03:20:32] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.20 seconds [03:20:42] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:20:43] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [03:21:22] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [03:21:22] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:25:12] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 710.49 seconds [03:28:23] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 266.74 seconds [05:02:16] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440500 [05:03:53] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440500 (owner: 10Marostegui) [05:05:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440500 (owner: 10Marostegui) [05:06:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440501 (https://phabricator.wikimedia.org/T191316) [05:06:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1067 after alter table (duration: 01m 07s) [05:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440500 (owner: 10Marostegui) [05:09:55] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4284534 (10Marostegui) 05Resolved>03Open Another predictive failure for this host, the same disk, as it is an used disk, it is too surprising: ``` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS,... [05:10:36] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440501 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:12:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440501 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:12:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440501 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:13:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1119 for alter table (duration: 00m 57s) [05:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:28] !log Deploy schema change on db1119 T191316 T192926 T89737 T195193 [05:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:34] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:13:34] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:13:35] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:13:35] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:34:37] !log installing gnupg security updates on trusty (Debian already fixed) [05:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:12] (03CR) 10Dzahn: [C: 032] DHCP: Change backup2001 MAC address from 1G MAC to 10G MAC [puppet] - 10https://gerrit.wikimedia.org/r/440485 (https://phabricator.wikimedia.org/T196477) (owner: 10Papaul) [05:50:06] !log slow rollout of debmonitor [05:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:51:23] (03PS4) 10Dzahn: DNS: Add production DNS entries for bast2002 [dns] - 10https://gerrit.wikimedia.org/r/439965 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul) [05:52:03] (03CR) 10Dzahn: [C: 032] DNS: Add production DNS entries for bast2002 [dns] - 10https://gerrit.wikimedia.org/r/439965 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul) [05:55:41] (03CR) 10Dzahn: [C: 04-1] "this would result in 5 digits in the name. like bast20010" [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [05:59:05] (03PS4) 10Dzahn: DHCP: Add MAC address and netboot entries for lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [06:01:00] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address and netboot entries for lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [06:01:34] (03CR) 10Dzahn: [C: 032] "fixed it. actually adding "2009 and 2010". what recipe are 2007 and 2008 using?" [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [06:01:40] (03PS5) 10Dzahn: DHCP: Add MAC address and netboot entries for lvs2009 and lvs2010 [puppet] - 10https://gerrit.wikimedia.org/r/440360 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [06:12:53] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [06:13:04] (03CR) 10Giuseppe Lavagetto: [C: 031] mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [06:16:30] (03PS2) 10Giuseppe Lavagetto: monitoring: Remove unused 'graphite_anomaly' command [puppet] - 10https://gerrit.wikimedia.org/r/437365 (owner: 10Krinkle) [06:17:14] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for PDNS recursor Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/437949 (https://phabricator.wikimedia.org/T135991) [06:17:32] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: Remove unused 'graphite_anomaly' command [puppet] - 10https://gerrit.wikimedia.org/r/437365 (owner: 10Krinkle) [06:21:43] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [06:27:17] (03PS5) 10Giuseppe Lavagetto: Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949 (owner: 10Chad) [06:28:03] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/apt/keys/ubuntucloud.gpg] [06:58:33] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:57] (03CR) 10Elukey: [C: 032] Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949 (owner: 10Chad) [07:24:33] (03PS4) 10Ema: reload-vcl: add --separate-vcls [puppet] - 10https://gerrit.wikimedia.org/r/440342 (https://phabricator.wikimedia.org/T164609) [07:26:02] (03CR) 10Ema: [C: 032] reload-vcl: add --separate-vcls [puppet] - 10https://gerrit.wikimedia.org/r/440342 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [07:31:40] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609#4284629 (10ema) [07:34:18] (03PS3) 10Volans: admin: Port matrix.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/438116 (owner: 10Legoktm) [07:34:53] volans: :D [07:35:00] :-( [07:35:03] sorry [07:35:10] (03CR) 10Volans: [C: 032] admin: Port matrix.py to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/438116 (owner: 10Legoktm) [07:35:20] I completelty forgot about that :-) done [07:42:51] <_joe_> lol [07:47:10] did we fix the python3 ops/puppet ci yet btw? [07:47:32] <_joe_> I don't know, I'll be honest [07:48:29] https://phabricator.wikimedia.org/T184435 is the task, seems still open [07:49:34] <_joe_> I am aware of the ticket, but I did nothing about it :( [07:49:54] paravoid: not that I know of [07:50:47] (03CR) 10Gehel: [C: 04-1] "Looks good, minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [07:51:10] How can one see how full our memcached cluster is? [07:52:00] addshore: https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats?orgId=1 has some info IIRC [07:52:31] I guess Current items is kind of the thing to look at, but that doesn't really say how full [07:52:33] <_joe_> addshore: 100% full [07:52:46] <_joe_> addshore: look at "evictions" [07:52:48] 49 million is a fair few entries :P [07:53:23] <_joe_> if you have more than 0 evictions, your cluster is somewhat full [07:53:47] gotcha [07:54:03] what determines level of fullness / when keys start being evicted? number of keys or disk? [07:54:14] <_joe_> disk is untouched [07:54:21] <_joe_> so the answer is somewhat complex [07:54:35] as always :) [07:54:40] <_joe_> memcached divides its available memory into slabs for objects of the same size [07:54:40] for the evictions also per slab is better, given that you might be full on one particular slab and have most evictions for those [07:54:58] <_joe_> volans: nowadays slabs are dynamically allocated [07:55:16] <_joe_> so whenever a slab is full, you start evicting thee [07:55:32] Looking at https://wikitech.wikimedia.org/wiki/Memcached I see there is a mcc.php maint script [07:55:33] <_joe_> so as I was saying, slabs get enlarged and shrinked upon need [07:55:41] Is there a way to see how many keys exist for a given prefix somehow? [07:55:42] <_joe_> but that works up to a point [07:55:51] <_joe_> addshore: not that I know of [07:56:00] <_joe_> memcached has no querying capabilities [07:56:07] ack [07:56:09] <_joe_> you can dump all the keys on all 18 serves [07:56:16] that sounds large [07:56:28] Is that a bad idea or something that is fine to do? [07:57:23] <_joe_> it is definitely a bad idea [07:57:41] I won't be doing that then :) [07:57:53] <_joe_> yeah not on a friday, even [07:57:58] <_joe_> ok, I'm out, bbl [07:58:03] o/ [07:58:11] <_joe_> elukey can answer more questions in my absence though :) [07:58:20] :D [07:59:13] elukey: the question is essentially, would doubling (roughly) the number of wikibase cache entries in memcached be a feasible idea. Relates to https://phabricator.wikimedia.org/T197252#4284642 [07:59:33] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#3882789 (10Legoktm) In your tox.ini, you can do something like ``` [testenv:flake8] commands = flake8 deps = flake8 basepython = pytho... [08:02:38] We need a phab admin to block https://phabricator.wikimedia.org/p/238482n375/ [08:05:38] !log rolling restart of elasticsearch eqiad for plugin upgrade - T194245 [08:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:43] T194245: Implement searching of 'depicts' on commons with the 'quantity' qualifier - https://phabricator.wikimedia.org/T194245 [08:05:48] addshore: my knowledge about memcached is a bit rusty, I would need to get some knowledge in my brain from swap first :D As first thought I'd check the size of the keys and their distribution, to get an idea about what slabs will be affected. The number of keys itself is a bit generic IIRC to get a precise statement.. it would also be good to know the number of those keys, just to have an idea [08:05:54] but it might be tricky [08:05:57] about the size [08:06:09] if you have patience I'll try to check this afternoon, but no promises :) [08:06:10] 10Operations, 10Puppet, 10AbuseFilter, 10Analytics-Kanban, and 13 others: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4285292 (10238482n375) p:05Normal>03Lowest a:05fgiunchedi>03None SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zL... [08:06:24] elukey: I'm in no particular rush :) [08:08:22] 10Operations, 10AbuseFilter, 10Analytics-Kanban, 10Commons, and 14 others: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#4285719 (10238482n375) p:05Triage>03Lowest a:05kaldari>03None SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLg... [08:08:26] bawolff: yes, someone really needs to block that person [08:08:42] bawolff: why aren't you just a phab admin? [08:08:43] blocked now [08:08:47] :D [08:08:57] volons did the honours [08:09:21] I was thinking I should ask Andre about that when its normal hours ;) [08:16:03] bawolff: I'm still getting notifications about that use doing stuff, but I guess that is just phab lag? [08:16:15] I guess so [08:16:21] * bawolff doesn't know much about phab [08:16:34] Yes [08:26:29] (03PS1) 10Elukey: profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 [08:27:17] (03PS2) 10Elukey: profile::hadoop::spark2: explicitly require hive client's config [puppet] - 10https://gerrit.wikimedia.org/r/440507 [08:29:09] 10Operations, 10monitoring, 10Patch-For-Review, 10User-Elukey: Configure puppetdb to export metrics via Prometheus JMX Agent - https://phabricator.wikimedia.org/T184796#4287892 (10Volans) p:05Lowest>03Normal a:03elukey [08:33:28] (03PS1) 10Ema: varnish::instance: pass -s argument to reload-vcl [puppet] - 10https://gerrit.wikimedia.org/r/440508 (https://phabricator.wikimedia.org/T164609) [08:35:17] now spammer: https://phabricator.wikimedia.org/T197406#4289005 ? [08:35:26] new* [08:35:55] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config, 10Security: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#4289349 (10Volans) [08:35:58] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config, 10Security: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#4289353 (10Volans) [08:36:00] (03PS2) 10Ema: varnish::instance: pass -s argument to reload-vcl [puppet] - 10https://gerrit.wikimedia.org/r/440508 (https://phabricator.wikimedia.org/T164609) [08:36:08] 10Operations, 10Puppet, 10Patch-For-Review, 10Security: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#4289357 (10Volans) [08:36:13] 10Operations, 10Puppet, 10Patch-For-Review, 10Security: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#4289362 (10Volans) p:05Lowest>03High a:03herron [08:36:14] ah, that was blocked too already [08:39:56] (03CR) 10Dzahn: [C: 032] Update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/440467 (https://phabricator.wikimedia.org/T197268) (owner: 10Framawiki) [08:40:18] (03PS2) 10Dzahn: Update group photo on people.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/440467 (https://phabricator.wikimedia.org/T197268) (owner: 10Framawiki) [08:44:11] 10Operations, 10Patch-For-Review: Update people.wikimedia.org with the 2018 Wikimedia hackathon group photo - https://phabricator.wikimedia.org/T197268#4289419 (10Dzahn) 05Open>03Resolved Thanks! The photo has been updated. [08:45:20] 10Operations, 10Puppet, 10Goal, 10Security: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4289427 (10Volans) [08:46:11] 10Operations, 10Puppet, 10Security: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564#4289439 (10Volans) [08:49:08] 10Operations, 10hardware-requests: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#4289521 (10Dzahn) [08:49:31] 10Operations, 10hardware-requests: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3884434 (10Dzahn) [08:52:25] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for PDNS recursor Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/437949 (https://phabricator.wikimedia.org/T135991) [08:53:59] 10Operations, 10Puppet, 10Patch-For-Review, 10Security: Investigate landscape of PuppetDB Frontends and Provision One - https://phabricator.wikimedia.org/T184563#4289952 (10Volans) p:05Lowest>03Normal a:03Volans [08:54:01] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#4289958 (10Dzahn) p:05Lowest>03Normal a:03RobH [08:54:03] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#4289964 (10Dzahn) [08:54:52] 10Operations, 10Security: Update people.wikimedia.org with the 2017 Wikimedia hackathon group photo - https://phabricator.wikimedia.org/T184338#4289970 (10Dzahn) p:05Lowest>03Normal [08:55:16] 10Operations: Update people.wikimedia.org with the 2017 Wikimedia hackathon group photo - https://phabricator.wikimedia.org/T184338#4289977 (10Dzahn) [08:58:02] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for PDNS recursor Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/437949 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:58:18] 10Operations, 10Puppet, 10DBA: Move mariadb_maintenance away from terbium/wasat (mediawiki_maintenance) - https://phabricator.wikimedia.org/T184797#4289982 (10Dzahn) p:05Lowest>03Normal a:03jcrespo [09:00:25] 10Operations, 10Commons, 10Wikimedia-SVG-rendering, 10media-storage, 10Patch-For-Review: Install Noto fonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T184664#4289989 (10Dzahn) p:05Lowest>03Normal a:03kaldari [09:02:15] (03PS1) 10Muehlenhoff: Restrict prometheus-pdns-rec-exporter auto restart to jessie and later [puppet] - 10https://gerrit.wikimedia.org/r/440509 [09:03:06] (03CR) 10jerkins-bot: [V: 04-1] Restrict prometheus-pdns-rec-exporter auto restart to jessie and later [puppet] - 10https://gerrit.wikimedia.org/r/440509 (owner: 10Muehlenhoff) [09:03:56] !log deploy patch T197279 [09:03:57] (03PS2) 10Muehlenhoff: Restrict prometheus-pdns-rec-exporter auto restart to jessie and later [puppet] - 10https://gerrit.wikimedia.org/r/440509 [09:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:45] (03CR) 10jerkins-bot: [V: 04-1] Restrict prometheus-pdns-rec-exporter auto restart to jessie and later [puppet] - 10https://gerrit.wikimedia.org/r/440509 (owner: 10Muehlenhoff) [09:05:22] (03PS3) 10Muehlenhoff: Restrict prometheus-pdns-rec-exporter auto restart to jessie and later [puppet] - 10https://gerrit.wikimedia.org/r/440509 [09:05:32] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:05:40] (03PS1) 10Aklapper: Phabricator: Block vandalism IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/440510 [09:06:19] 10Operations, 10Puppet, 10Patch-For-Review, 10Security, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4290003 (10Volans) a:03fgiunchedi [09:07:18] (03CR) 10Muehlenhoff: [C: 032] Restrict prometheus-pdns-rec-exporter auto restart to jessie and later [puppet] - 10https://gerrit.wikimedia.org/r/440509 (owner: 10Muehlenhoff) [09:07:32] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4290015 (10Volans) [09:08:17] 10Operations, 10Mail, 10Security: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#4290021 (10Dzahn) [09:09:19] 10Operations, 10Mail: All IP addresses used for sending emails by Wikimedia's services - https://phabricator.wikimedia.org/T184555#4290035 (10Dzahn) [09:09:34] (03CR) 10Dnvjdvsj: [C: 04-1] "This is not a good solution, there's a proxies every where and i'll come back very soon" [puppet] - 10https://gerrit.wikimedia.org/r/440510 (owner: 10Aklapper) [09:10:12] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [09:10:33] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:11:13] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [09:15:33] (03PS2) 10Dnvjdvsj: Phabricator: Block vandalism IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/440510 (owner: 10Aklapper) [09:16:03] RECOVERY - Disk space on furud is OK: DISK OK [09:20:46] !log fully remove ms-be1036 from swift due to hw failure - T196873 [09:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:51] T196873: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873 [09:26:25] 10Operations, 10Gadgets: test.wp shows the gadgets from test2.wp - https://phabricator.wikimedia.org/T197450#4290095 (10TheDJ) [09:29:28] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review, 10Security: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4290110 (10Vgutierrez) p:05Lowest>03Normal a:03Cmjohnson [09:30:31] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067#4290118 (10fgiunchedi) List of metrics at https://phabricator.wikimedia.org/P7262, I'll remove those if the list looks good. [09:32:45] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067#4290128 (10Pchelolo) The list is insanely long, I've poked around and didn't find anything that should remain. LGTM. [09:35:11] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4290145 (10Vgutierrez) [09:43:52] !log reducing temp. db2040 consistency to speed up slave lag catch up [09:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:17] (03PS3) 10Ema: varnish::instance: pass -s argument to reload-vcl [puppet] - 10https://gerrit.wikimedia.org/r/440508 (https://phabricator.wikimedia.org/T164609) [09:45:41] !log reenabling db2048 consistency after slaves caught up [09:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:57] (03CR) 10jerkins-bot: [V: 04-1] varnish::instance: pass -s argument to reload-vcl [puppet] - 10https://gerrit.wikimedia.org/r/440508 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [09:46:52] 10Operations, 10ops-esams, 10Security: To purchase for next esams visit - https://phabricator.wikimedia.org/T184522#4290193 (10Dzahn) [09:47:12] 10Operations, 10ops-esams: To purchase for next esams visit - https://phabricator.wikimedia.org/T184522#4290197 (10Dzahn) [09:47:29] (03PS4) 10Ema: varnish::instance: pass -s argument to reload-vcl [puppet] - 10https://gerrit.wikimedia.org/r/440508 (https://phabricator.wikimedia.org/T164609) [09:48:50] !log delete cpjobqueue metrics older than 10d - T196067 [09:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:55] T196067: Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067 [09:49:15] 10Operations, 10Analytics, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067#4290234 (10fgiunchedi) 05Open>03Resolved [09:49:22] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10Security: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#4290237 (10akosiaris) [09:50:49] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462#4290243 (10akosiaris) [09:51:00] (03PS3) 10Paladox: Phabricator: Block vandalism IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/440510 (owner: 10Aklapper) [09:51:12] (03CR) 10Paladox: "Reverting spam" [puppet] - 10https://gerrit.wikimedia.org/r/440510 (owner: 10Aklapper) [09:51:45] (03PS4) 10Dnvjdvsj: Phabricator: Block vandalism IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/440510 (owner: 10Aklapper) [09:54:32] (03PS2) 10Ema: cache::text: ship cache_misc VCL [puppet] - 10https://gerrit.wikimedia.org/r/440157 (https://phabricator.wikimedia.org/T164609) [09:55:05] Can some body block that spammer on gerrit ^^ [09:55:22] bawolff: ^^ [09:55:30] Spammer is on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440510/ [09:55:54] done [09:56:18] Thanks [09:58:03] (03PS5) 10Paladox: Phabricator: Block vandalism IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/440510 (owner: 10Aklapper) [09:58:07] (03PS6) 10Aklapper: Phabricator: Block vandalism IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/440510 [10:03:07] (03PS4) 10Addshore: Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (https://phabricator.wikimedia.org/T197454) [10:03:20] (03PS4) 10Addshore: Load WikibaseLexeme on testwiki (again again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438005 (https://phabricator.wikimedia.org/T197454) [10:03:32] (03PS4) 10Addshore: Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) [10:03:47] (03PS4) 10Addshore: Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) [10:04:09] (03PS5) 10Addshore: Load WikibaseLexeme on testwiki (again again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438005 (https://phabricator.wikimedia.org/T197454) [10:04:15] (03PS5) 10Addshore: Load WikibaseLexeme on all of group0 (again) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438006 (https://phabricator.wikimedia.org/T197454) [10:04:21] (03PS5) 10Addshore: Load WikibaseLexeme on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436498 (https://phabricator.wikimedia.org/T195615) [10:04:26] (03PS5) 10Addshore: Load WikibaseLexeme on all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436499 (https://phabricator.wikimedia.org/T195615) [10:06:55] (03PS5) 10Ema: varnish::instance: separate VCLs support [puppet] - 10https://gerrit.wikimedia.org/r/440508 (https://phabricator.wikimedia.org/T164609) [10:08:41] (03CR) 10Ema: "https://puppet-compiler.wmflabs.org/compiler02/11530/" [puppet] - 10https://gerrit.wikimedia.org/r/440508 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [10:09:43] 10Operations, 10Packaging, 10Patch-For-Review, 10Release: SCAP: Upload debian package version 3.7.5-1 - https://phabricator.wikimedia.org/T184774#4290385 (10akosiaris) p:05Lowest>03High a:03fgiunchedi [10:15:09] 10Operations, 10fundraising-tech-ops, 10netops: switch network port 2/0/3 (frdb1003) back to administration-vlan - https://phabricator.wikimedia.org/T184723#4290426 (10akosiaris) p:05Lowest>03Triage a:03ayounsi [10:17:48] 10Operations, 10ops-esams, 10Security: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184528#4290472 (10Vgutierrez) [10:18:04] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184528#4290475 (10Vgutierrez) [10:18:43] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [10:18:47] 10Operations, 10ops-eqiad, 10Security: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T184514#4290480 (10Vgutierrez) [10:18:55] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1033 - https://phabricator.wikimedia.org/T184514#4290483 (10Vgutierrez) [10:19:54] 10Operations, 10ops-esams, 10Security: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184533#4290486 (10Vgutierrez) [10:20:06] 10Operations, 10ops-esams, 10Security: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184530#4290490 (10Vgutierrez) [10:20:14] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184533#4290493 (10Vgutierrez) [10:20:30] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T184530#4290495 (10Vgutierrez) [10:21:52] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:22:03] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:22:33] 10Operations, 10Analytics, 10hardware-requests: eqiad: (1) new stat box to offload users from stat1005 - https://phabricator.wikimedia.org/T196345#4290498 (10elukey) I had a chat with Ottomata and I think that the spare could work for the moment. The warranty will expire soonish so in case we'll see that a m... [10:22:43] PROBLEM - HP RAID on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:24:06] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: prometheus-blazegraph-exporter failing to start after reboot - https://phabricator.wikimedia.org/T184434#4290503 (10Vgutierrez) p:05Lowest>03Normal a:03fgiunchedi [10:24:16] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: prometheus-blazegraph-exporter failing to start after reboot - https://phabricator.wikimedia.org/T184434#4290510 (10Vgutierrez) [10:25:15] (03PS7) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [10:27:52] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440515 [10:28:47] (03PS8) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [10:28:49] mutante: get into the office? or? [10:29:22] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [10:29:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440515 (owner: 10Marostegui) [10:29:35] addshore: i got into 1st floor and then had to leaave again for an errand and then want to come back [10:30:09] okay! everyone is in a meeting currently, raz_WMDE want to look for you but didnt find you :D [10:30:46] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440515 (owner: 10Marostegui) [10:30:51] i am currently not there, trying to get my passport. but i will come back [10:30:54] thanks addshore [10:31:13] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1119" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440515 (owner: 10Marostegui) [10:32:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1119 after alter table (duration: 00m 58s) [10:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:12] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 61509 MB (12% inode=99%) [10:35:12] 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#4290582 (10Vgutierrez) [10:35:21] 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#4290585 (10Vgutierrez) [10:37:53] PROBLEM - IPMI Sensor Status on ms-be1034 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [10:38:02] (03PS7) 10Aklapper: Phabricator: Block vandalism IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/440510 [10:38:19] godog: ms-be1034 ^^^ [10:38:34] (03CR) 10Aklapper: [C: 04-1] "Garrrr, why the heck is ".gitreview" included here though I didn't use "--all"?" [puppet] - 10https://gerrit.wikimedia.org/r/440510 (owner: 10Aklapper) [10:39:16] volans: thanks! I'll take a look [10:42:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#4290705 (10Legoktm) p:05Lowest>03Normal a:03RobH [10:42:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to analytics cluster for bawolff - https://phabricator.wikimedia.org/T184582#4290711 (10Legoktm) [10:44:02] RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [10:44:34] (03PS9) 10ArielGlenn: allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) [10:49:23] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [10:51:53] RECOVERY - Disk space on elastic1020 is OK: DISK OK [10:52:43] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:00:53] PROBLEM - SSH on ms-be1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:01:22] PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:01:52] RECOVERY - SSH on ms-be1035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [11:03:40] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4290888 (10Vgutierrez) I've just tested a new build of librdkafka (0.11.3-1~bpo8+1+wikimedia2) on cp1008 that includes the new TLS configuration... [11:05:53] 10Puppet, 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team (Current), and 2 others: Puppet broken on deployment-ores01 due to missing hieradata - https://phabricator.wikimedia.org/T184478#4290893 (10Ladsgroup) a:03Ladsgroup [11:07:43] RECOVERY - IPMI Sensor Status on ms-be1034 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [11:08:22] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.46 seconds [11:12:33] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [11:18:17] (03PS1) 10Vgutierrez: kafka: Ensure JVM AES intrinsics usage if AES ciphersuites are enabled [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) [11:23:23] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [11:24:11] (03PS2) 10Vgutierrez: kafka: Ensure JVM AES intrinsics usage if AES ciphersuites are enabled [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) [11:24:32] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [11:26:39] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler02/11532/" [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [11:29:12] 10Operations, 10Gadgets: test.wp shows the gadgets from test2.wp - https://phabricator.wikimedia.org/T197450#4290095 (10Legoktm) I'm randomly guessing that this could be related to the MCR stuff? https://lists.wikimedia.org/pipermail/wikitech-l/2018-June/090206.html For this to happen somehow the test2 messag... [11:36:42] PROBLEM - SSH on ms-be1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:38:34] 10Operations, 10MediaWiki-Cache: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450#4290095 (10Legoktm) I did a bit of debugging with eval.php, and this appears to affect all overridden system messages on test.wp - they're using the test2.wp version. [11:39:13] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [11:39:52] RECOVERY - SSH on ms-be1035 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [11:49:22] 10Operations, 10Developer-Relations, 10Discourse, 10Security: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461#4291256 (10Ladsgroup) [11:49:23] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [11:49:40] 10Operations, 10Developer-Relations, 10Discourse: Discourse migration from wmflabs to production - https://phabricator.wikimedia.org/T184461#4291260 (10Ladsgroup) [11:51:17] (03PS4) 10Rduran: [WIP] Add unit tests for transfer.py and CumminExecution [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/437503 [11:52:42] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:55:22] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [11:56:22] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy [12:01:02] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:10:22] [{exception_id}] {exception_url} Wikimedia\Rdbms\DBQueryError from line 1443 of /srv/mediawiki/php-1.32.0-wmf.8/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema upda [12:10:26] exception ^ [12:11:03] starting at 11:35ish https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=1529061367667&to=1529064510211 [12:12:17] on commons looks like [12:13:00] marostegui: could be db1119 repool? [12:13:02] !log OS install on bast2002 [12:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:43] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665#4291377 (10Papaul) [12:19:52] the errors seem to be of the type "Error: 1205 Lock wait timeout exceeded; try restarting transaction" [12:20:46] CC: jynus [12:22:16] (10.64.48.23, so db1068) [12:23:58] jynus: FYI, that long-running maintenance script is finished now. I don't have a timestamp though, so I don't know how close it came to my 16-hour estimate from yesterday. ;) [12:24:01] ah, perhaps some background job perhaps [12:25:06] 10Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#4291395 (10jcrespo) This is now fixed on MySQL 8.0 https://dev.mysql.com/doc/refman/8.0/en/innodb-auto-increment-handling.html#innodb-auto-increment-initialization [12:25:53] anomie: I am disapointed you don't have micro-second precision [12:26:13] anomie: sadly, lag will continue for a while until everthing catches up [12:27:10] I would also like you to discuss with performance at some point the impact on maintenance tasks when we go dual active-active dcs [12:27:38] I have some doubts how that is going to scale, I have pending to write a task/RFC on that [12:28:18] I forgot the ":-P" on my first line, hope the context was clear [12:30:02] !log reenabling db2040 consistency after slaves caught up [12:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:59] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450#4291412 (10Krinkle) [12:30:59] jynus: are the exceptions mentioned above somehow related to lag? [12:31:56] where? [12:32:23] < ema> starting at 11:35ish https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen&from=1529061367667&to=1529064510211 [12:32:30] < ema> the errors seem to be of the type "Error: 1205 Lock wait timeout exceeded; try restarting transaction" [12:32:37] < ema> (10.64.48.23, so db1068) [12:32:44] checking [12:33:58] it is the master not the replica [12:34:10] jobqueue [12:34:28] Lock wait timeout exceeded [12:34:39] Issues with HTMLCacheUpdateJob::invalidateTitles [12:36:47] it could be the a cron job that is running now [12:37:14] <_joe_> jynus: uhm let me check [12:38:22] !log killing refresh counts on commonswiki [12:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:23] why not ping me earlier? [12:39:28] <_joe_> Pchelolo: I don't see an abnormal number of jobs on htmlCacheUpdate right now [12:39:56] I think that was the consequence, not the cause [12:40:16] <_joe_> jynus: cpjobqueue is processing 50 job/s across all wikis, that should be ok [12:40:25] <_joe_> but of course, maybe it's not [12:40:27] refresh counts is creating a lock on the master while running a count(*) [12:40:43] <_joe_> so the conjob? [12:40:56] I am going to create a task [12:41:46] and I think it is anomie's fault, actually :-) [12:41:58] because it is not a cron :-) [12:42:18] (hope you don't take that seriously, anomie) [12:42:34] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450#4291453 (10Legoktm) a:03Legoktm Scratch that. It seems more likely that this was caused by the mcrouter deployment... [12:43:06] jynus: I've mentioned you at 12:20, maybe you didn't get the ping? [12:43:30] I didn't [12:43:32] sorry [12:44:03] 10Operations, 10ops-codfw: ms-be2023 fails to (re)boot - https://phabricator.wikimedia.org/T184785#4291471 (10Aklapper) p:05Lowest>03Triage [12:44:08] another ping masked it [12:44:15] ah [12:44:27] we are ok now (I think) [12:44:33] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:44:35] 10Operations, 10ops-eqiad: Hardware check on mw1271 - https://phabricator.wikimedia.org/T184722#4291514 (10Aklapper) p:05Lowest>03Triage [12:45:17] looks like :) [12:45:20] 10Operations, 10procurement: Give access to S4 (procurement tasks) to Erika Bjune - https://phabricator.wikimedia.org/T184617#4291586 (10Aklapper) p:05Lowest>03Triage [12:45:42] (03PS1) 10Legoktm: Make sure that mcrouter BagOStuff goes through ObjectCache::newFromParams() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440526 (https://phabricator.wikimedia.org/T197450) [12:46:42] jynus: Maybe an hour ago I manually ran Category::refreshCounts() on a bunch of categories to try to clean up after the deadlock bug. When I noticed my quick script to do that was pausing for too long on some big Commons categories, I killed it (although now that I think about it I don't know if that killed the DB queries, since PHP→MySQL can be dumb that way) and rewrote it to skip any categories that had too many rows. [12:47:15] (03CR) 10jerkins-bot: [V: 04-1] Make sure that mcrouter BagOStuff goes through ObjectCache::newFromParams() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440526 (https://phabricator.wikimedia.org/T197450) (owner: 10Legoktm) [12:47:18] 10Operations, 10ops-codfw: ms-be2023 fails to (re)boot - https://phabricator.wikimedia.org/T184785#4291614 (10Aklapper) a:03fgiunchedi [12:47:47] 10Operations, 10ops-eqiad: Hardware check on mw1271 - https://phabricator.wikimedia.org/T184722#4291624 (10Aklapper) a:03Cmjohnson [12:47:58] https://phabricator.wikimedia.org/T195397#4291625 [12:48:16] anomie: please call me at that moment- it doesn't hurt to check [12:48:18] (03PS2) 10Legoktm: Make sure that mcrouter BagOStuff goes through ObjectCache::newFromParams() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440526 (https://phabricator.wikimedia.org/T197450) [12:48:47] good news is that Ibelive most impacted requests were jobqueue [12:49:01] I am checking how many non-jobqueue requests were impacted, if any [12:49:06] *now [12:49:15] <_joe_> jynus: <3 [12:49:30] so jobqueue on the receiving end this time :-) [12:49:48] <_joe_> and it's ok, it will retry [12:50:10] <_joe_> https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?orgId=1&from=now-3h&to=now&var-site=eqiad&var-type=htmlCacheUpdate [12:50:17] <_joe_> see the retry numbes [12:50:20] (03CR) 10Legoktm: [C: 032] Make sure that mcrouter BagOStuff goes through ObjectCache::newFromParams() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440526 (https://phabricator.wikimedia.org/T197450) (owner: 10Legoktm) [12:50:21] <_joe_> *numbers [12:50:52] these are cleaned up error numbers https://logstash.wikimedia.org/goto/12c11ff863504b8def5d22df42088718 [12:51:04] but that would include legimimate errors that are unrelated [12:51:22] 5000 job queue [12:51:27] around 1000 others [12:52:02] (03Merged) 10jenkins-bot: Make sure that mcrouter BagOStuff goes through ObjectCache::newFromParams() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440526 (https://phabricator.wikimedia.org/T197450) (owner: 10Legoktm) [12:52:20] (03CR) 10jenkins-bot: Make sure that mcrouter BagOStuff goes through ObjectCache::newFromParams() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440526 (https://phabricator.wikimedia.org/T197450) (owner: 10Legoktm) [12:52:37] is there a way to see edits numbers per wiki? [12:52:46] I guess recentchanges is the most direct [12:53:33] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [12:55:39] jynus: https://wikipulse.herokuapp.com/ [12:55:42] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [12:56:00] legoktm: I wanted historics [12:56:12] ah [12:56:15] I read too fast [12:56:18] like https://grafana.wikimedia.org/dashboard/db/edit-count [12:56:23] !log legoktm@deploy1001 Synchronized wmf-config/mc.php: Make sure that mcrouter BagOStuff goes through ObjectCache::newFromParams() - T197450 (duration: 00m 57s) [12:56:23] but per wiki or per section [12:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:27] T197450: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450 [12:56:51] I guess I should be the one implementing that [12:58:07] I don't see a huge amount os safe failures, but I think at least some slowdown happened [13:01:32] 10Operations, 10AbuseFilter, 10Analytics-Kanban, 10DBA, and 12 others: Generate consistent logical database backups in CODFW - https://phabricator.wikimedia.org/T184699#4291756 (10Aklapper) a:03jcrespo [13:01:38] 10Operations, 10ops-eqiad, 10AbuseFilter, 10Analytics-Kanban, and 14 others: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4291755 (10Aklapper) a:03Cmjohnson [13:02:16] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T184787#4291773 (10Aklapper) p:05Lowest>03Normal [13:02:33] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4291794 (10Aklapper) p:05Lowest>03Normal [13:02:35] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665#4265037 (10Papaul) a:05Papaul>03Dzahn @Dzahn all yours [13:02:42] 10Operations, 10DBA, 10Goal: Generate consistent logical database backups in CODFW - https://phabricator.wikimedia.org/T184699#4291798 (10Aklapper) p:05Lowest>03Normal [13:03:00] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request for Tonina to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T184620#4291826 (10Aklapper) p:05Lowest>03Normal [13:03:49] I am not seeing a huge impact on edits [13:04:40] https://phabricator.wikimedia.org/P7264 [13:06:53] 10Operations, 10AbuseFilter, 10Analytics-Kanban, 10Data-release, and 13 others: Alert instrumentation returning 500 errors - https://phabricator.wikimedia.org/T184721#4291838 (10Aklapper) a:03ema [13:07:41] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#4291848 (10Aklapper) a:03elukey [13:07:49] 10Operations, 10ops-codfw, 10Patch-For-Review: mw2140 unresponsive, mgmt not accessible - https://phabricator.wikimedia.org/T184788#4291850 (10Aklapper) p:05Lowest>03High [13:08:02] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715#4291867 (10Aklapper) p:05Lowest>03High [13:08:08] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Alert instrumentation returning 500 errors - https://phabricator.wikimedia.org/T184721#4291865 (10Aklapper) p:05Lowest>03High [13:09:00] andre__: ping [13:09:17] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team, 10Patch-For-Review: test.wp is using test2.wp's message cache - https://phabricator.wikimedia.org/T197450#4291900 (10Legoktm) 05Open>03Resolved OK, so this is fixed, but some of the core messages are missing - if someone wants to do a fu... [13:09:23] Cyberpower678, pong? [13:09:44] andre__: any chance I could get the needed permissions to clean up some of the phab tickets? [13:09:58] Cyberpower678: Permissions for what? [13:10:24] andre__: unset the security policy, delete the spam comments, etc... [13:10:42] Cyberpower678: See https://www.mediawiki.org/wiki/Wikimedia_Security_Team/Policy/Access_To_Security_Issues [13:10:54] Deleting comments requires phab admin rights [13:11:28] (03PS8) 10Paladox: Phabricator: Block vandalism IP addresses [puppet] - 10https://gerrit.wikimedia.org/r/440510 (owner: 10Aklapper) [13:12:08] andre__: is it possible to gain temporary access to these rights. A bunch of my tickets are a real mess now. [13:12:21] see the link I provided [13:12:34] It's not that I'm in the Security team to decide. We're cleaning up. [13:14:48] (03CR) 10Legoktm: "Followup: Change-Id: I44fbaf222e5082188ae3cd12574367abdb41e651" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436252 (owner: 10Aaron Schulz) [13:14:59] andre__: I sent you a PM [13:16:42] Cyberpower678, yeah, so? :) [13:16:57] It's a task. Okay. Not sure what you're expecting from me. [13:17:03] Just wanted to share the ticket with you. :-) [13:17:13] Privately of course. :p [13:17:48] (03PS1) 10Elukey: Move the varnishkafka submodule to environments/production [puppet] - 10https://gerrit.wikimedia.org/r/440530 [13:18:35] (03CR) 10jerkins-bot: [V: 04-1] Move the varnishkafka submodule to environments/production [puppet] - 10https://gerrit.wikimedia.org/r/440530 (owner: 10Elukey) [13:18:42] yep yep yep [13:23:05] 10Operations, 10netops: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147#4291942 (10ayounsi) [13:24:19] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, but it's late friday, let's avoid merging on a Friday. Furthermore, next week is the SRE summit so let's schedule this upgrade for T" [puppet] - 10https://gerrit.wikimedia.org/r/438135 (https://phabricator.wikimedia.org/T194342) (owner: 10KartikMistry) [13:25:13] d [13:26:30] 10Operations, 10netops: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147#4291949 (10ayounsi) [13:28:17] (03PS7) 10Alexandros Kosiaris: Prep to tighten PuppetDB access control - log client certificate details [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:28:21] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Prep to tighten PuppetDB access control - log client certificate details [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:29:14] 10Operations, 10netops: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147#4291955 (10ayounsi) [13:29:49] 10Operations, 10Wikimedia-Mailing-lists: Description - https://phabricator.wikimedia.org/T184455#4291973 (10Aklapper) [13:30:55] <_joe_> akosiaris: on a friday before leaving for a week? [13:31:26] <_joe_> thanks, sir :P [13:32:06] yw [13:33:16] I see nothing but NONE in the logs whoever [13:33:18] however [13:34:01] 10.64.48.45 - - - NONE - - [15/Jun/2018:13:33:39 +0000] blah blah [13:34:20] so ssl_client_s_dn is - ? [13:35:28] hm NONE means no certificate was present, makes sense [13:35:43] <_joe_> so the masters don't sent the request with a client cert apparently [13:35:47] <_joe_> *send [13:35:56] yup [13:36:09] (03CR) 10Elukey: "pcc looks good https://puppet-compiler.wmflabs.org/compiler02/11533/" [puppet] - 10https://gerrit.wikimedia.org/r/440530 (owner: 10Elukey) [13:36:28] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Log line example" [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [13:38:22] (03CR) 10Alexandros Kosiaris: [C: 031] scap: add service name to restart on deploy [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/440368 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:45:49] (03PS2) 10Elukey: Move the varnishkafka submodule to environments/production [puppet] - 10https://gerrit.wikimedia.org/r/440530 [13:46:04] (03CR) 10Volans: [V: 032 C: 032] scap: add service name to restart on deploy [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/440368 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [13:46:11] (03CR) 10Elukey: [V: 032 C: 032] Move the varnishkafka submodule to environments/production [puppet] - 10https://gerrit.wikimedia.org/r/440530 (owner: 10Elukey) [13:47:56] <_joe_> elukey: it worked on the puppetmasters [13:48:11] running puppet on cp1008 [13:48:11] <_joe_> /var/lib/git/operations/puppet/modules/varnishkafka is no more [13:48:20] \o/ [13:48:51] vgutierrez: o/ - can i run puppet on pink unicorn [13:48:52] ? [13:49:10] "test librdkafka1_0.11.3-1~bpo8+1+wikimedia2_amd64.deb" [13:49:45] <_joe_> elukey: I'm running on cp1045 [13:49:53] <_joe_> worse that can happen is a compilation error [13:49:58] <_joe_> and no, it compiles [13:50:02] <_joe_> \o/ [13:50:08] niiiceeeeeeeeeee [13:50:15] <_joe_> we found a way to get away from submodules without pain [13:50:36] s/we/you but ok :D [13:51:14] this will help a lot the merge of the others submodules into ops/puppet [13:51:28] _joe_ shall we also try to move the module back into its place? [13:51:43] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2060 - https://phabricator.wikimedia.org/T184464#4292288 (10Aklapper) [13:51:55] 10Operations, 10Wikimedia-Mailing-lists: Emails to the mailing list Global-renamers are send, but not received by Hotmail users - https://phabricator.wikimedia.org/T184344#4292295 (10Aklapper) p:05High>03Lowest [13:52:33] <_joe_> elukey: sure [13:52:38] 10Operations, 10Wikimedia-Mailing-lists: Emails to the mailing list Global-renamers are sent, but not received by Hotmail users - https://phabricator.wikimedia.org/T184344#4292317 (10Nemo_bis) [13:52:57] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Security: Decommission db1030 - https://phabricator.wikimedia.org/T184397#4292319 (10Reedy) [13:53:35] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1030 - https://phabricator.wikimedia.org/T184397#4292326 (10Reedy) [13:53:44] (03PS2) 10Andrew Bogott: nova-api: allow access to port 8774 for api access [puppet] - 10https://gerrit.wikimedia.org/r/440478 [13:54:11] (03CR) 10Ottomata: kafka: Ensure JVM AES intrinsics usage if AES ciphersuites are enabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [13:55:07] (03CR) 10Andrew Bogott: [C: 032] nova-api: allow access to port 8774 for api access [puppet] - 10https://gerrit.wikimedia.org/r/440478 (owner: 10Andrew Bogott) [13:56:08] (03PS1) 10Elukey: Move the varnishkafka submodule back to its original place [puppet] - 10https://gerrit.wikimedia.org/r/440535 [13:57:48] (03PS2) 10Giuseppe Lavagetto: Move the varnishkafka submodule back to its original place [puppet] - 10https://gerrit.wikimedia.org/r/440535 (owner: 10Elukey) [13:58:53] _joe_ I am getting failures with pcc [14:00:30] ahahah lol https://puppet-compiler.wmflabs.org/compiler02/11535/ [14:00:38] [ 2018-06-15T13:58:47 ] INFO: Compilation failed for hostname cp1051.eqiad.wmnet in environment prod. [14:00:50] and then if completes correctly with a no op [14:02:52] <_joe_> ? [14:02:59] <_joe_> ok [14:03:04] <_joe_> let's go [14:03:06] (03CR) 10Andrew Bogott: [C: 04-2] "not ready yet, alas" [puppet] - 10https://gerrit.wikimedia.org/r/432703 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [14:03:12] (03CR) 10Giuseppe Lavagetto: [C: 032] Move the varnishkafka submodule back to its original place [puppet] - 10https://gerrit.wikimedia.org/r/440535 (owner: 10Elukey) [14:03:39] (03PS1) 10Alexandros Kosiaris: Fix a small comment typo [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440536 [14:04:16] <_joe_> elukey: let's do all of them? [14:05:01] (03CR) 10jerkins-bot: [V: 04-1] Fix a small comment typo [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440536 (owner: 10Alexandros Kosiaris) [14:05:04] _joe_ sure [14:05:10] <_joe_> elukey: let's first check the labs puppetmasters [14:05:17] <_joe_> something tells me they might not be as ok [14:05:30] yes git_sync might have not liked the changes [14:05:41] <_joe_> let's see about that [14:05:52] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1030 - https://phabricator.wikimedia.org/T184397#4292568 (10Aklapper) p:05Lowest>03Normal [14:09:05] re-enabled puppet where vk is used [14:09:37] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1030 - https://phabricator.wikimedia.org/T184397#4292685 (10Aklapper) a:03Cmjohnson [14:10:49] <_joe_> elukey: git-sync-upstream seems to have gone ok [14:10:59] <_joe_> it just had some untracked files in that directory [14:11:29] I checked labs-puppetmaster.wikimedia.org and modules/varnishkafka seems fine [14:11:43] nothing untracked [14:12:00] tested a noop change to /etc/varnishkafka/webrequest.conf on cp3008, all good [14:12:36] !log OS install on backup2001 [14:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:56] _joe_ prepping jmxtrans and kafkatee [14:14:03] last one will be nginx [14:14:31] (or better to do it after the offsite) [14:16:07] 10Operations, 10Dumps-Generation: Reboot snapshot*, dumpsdata*, dataset1001, ms1001, francium - https://phabricator.wikimedia.org/T184443#4292765 (10Aklapper) [14:16:12] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade cache_text to Varnish 5 - https://phabricator.wikimedia.org/T184448#4292761 (10Aklapper) [14:16:49] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T184285#4292801 (10Aklapper) [14:18:14] _joe_ so a minor issue is that people pulling into their local repo will get some issues with untracked files [14:18:51] (Andrew just tried and reported the issue, but clearning the dir works as a charm) [14:18:59] *cleaning [14:19:18] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [14:19:44] I can send an email later on to ops@ as heads up [14:19:51] <_joe_> elukey: only if you pull after all is done and don't do a submodule update in the meanwhile [14:19:59] <_joe_> let's do all the others! [14:21:22] all including nginx? [14:21:44] <_joe_> elukey: or you do it next week at the summit [14:21:51] <_joe_> either thing is ok [14:22:06] we could do jmxtrans and kafkatee now [14:22:12] and the nginx monday after the offsite? [14:22:16] *then [14:22:38] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:25:37] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: Traceback (most recent call last) [14:25:37] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last) [14:26:18] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: Traceback (most recent call last) [14:26:22] (03PS1) 10Hashar: ci: add some gated extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/440539 (https://phabricator.wikimedia.org/T197469) [14:26:51] (03PS1) 10Elukey: Move jmxtrans and kafkatee submodules to environments/production [puppet] - 10https://gerrit.wikimedia.org/r/440540 [14:27:17] (03CR) 10Hashar: "I have not cherry picked that patch on the CI puppetmaster, I think the Docker slaves are tight on disk space so I dont want to have them " [puppet] - 10https://gerrit.wikimedia.org/r/440539 (https://phabricator.wikimedia.org/T197469) (owner: 10Hashar) [14:27:27] hmm, those seem like broken checks. looking [14:27:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: Traceback (most recent call last) [14:27:31] (03CR) 10jerkins-bot: [V: 04-1] Move jmxtrans and kafkatee submodules to environments/production [puppet] - 10https://gerrit.wikimedia.org/r/440540 (owner: 10Elukey) [14:28:19] (03CR) 10Volans: [V: 032 C: 032] "LGTM, thanks for the fix!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/440536 (owner: 10Alexandros Kosiaris) [14:28:45] the ripe checks are failing because of 500 errors from the ripe API endpoint [14:29:29] ah [14:29:55] https://atlas.ripe.net/ itself returns 500 right now [14:30:13] _joe_ mind to check https://gerrit.wikimedia.org/r/440540 ? [14:30:13] herron: if you have a sec have a look at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/440365/ please ;) [14:30:18] p.s. morning :) [14:30:47] for sure and good afternoon to you! [14:30:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] Puppet agent: fix redirect to syslogs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440365 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [14:32:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 7 probes of 303 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:32:54] 10Operations, 10Analytics, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#4293105 (10Aklapper) [14:33:36] (03CR) 10Alexandros Kosiaris: [C: 031] Puppet agent: fix redirect to syslogs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440365 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [14:33:48] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 0 probes of 322 (alerts on 19) - https://atlas.ripe.net/measurements/11645085/#!map [14:33:48] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 0 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [14:33:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 6 probes of 303 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:35:29] (03CR) 10Volans: Puppet agent: fix redirect to syslogs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440365 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [14:36:12] 10Puppet, 10Analytics-Kanban, 10User-Elukey: analytics VPS project puppet errors - https://phabricator.wikimedia.org/T184482#4293128 (10Aklapper) [14:36:53] (03PS2) 10Giuseppe Lavagetto: Fix gemspec warnings [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/440293 [14:37:27] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix gemspec warnings [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/440293 (owner: 10Giuseppe Lavagetto) [14:39:32] 10Operations, 10Wikimedia-Mailing-lists: Create mailing list for Wikimedia Levant user group - https://phabricator.wikimedia.org/T184352#3880733 (10Aklapper) [14:43:16] !log restart varnishkafka-eventlogging on cp5012 as attempt to clear out the errors (not needed but logging it anyway) [14:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:43] 10Operations, 10Wikimedia-Mailing-lists: Create new mailing list BiblioWiki@ for Italian group of librarians - https://phabricator.wikimedia.org/T184438#3882910 (10Aklapper) [14:45:16] 10Operations, 10Wikimedia-Mailing-lists: BilioWiki - https://phabricator.wikimedia.org/T184440#3882961 (10Aklapper) [14:45:19] 10Operations, 10Wikimedia-Mailing-lists: BiblioWiki - https://phabricator.wikimedia.org/T184441#3882978 (10Aklapper) [14:45:32] <_joe_> elasticsearch seems to be down for lvs1016 [14:45:36] 10Puppet, 10Beta-Cluster-Infrastructure, 10Services (done): Puppet disabled for a month on deployment-restbase0[12] instances - https://phabricator.wikimedia.org/T184477#3884325 (10Aklapper) [14:45:53] (03CR) 10Herron: [C: 031] "makes sense to me" [puppet] - 10https://gerrit.wikimedia.org/r/440365 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [14:46:11] <_joe_> and it just recovered [14:46:27] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2037 - https://phabricator.wikimedia.org/T184390#4293222 (10Aklapper) [14:46:49] 10Operations, 10Deployments: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470#4293228 (10Dzahn) [14:47:30] 10Operations, 10Scap: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470#4293242 (10Dzahn) [14:48:43] 10Operations, 10Scap: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470#4293228 (10Dzahn) [14:49:30] 10Operations, 10Release-Engineering-Team, 10Scap: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470#4293258 (10Dzahn) [14:49:42] !log restart varnishkafka-eventlogging on cp4028, errors logged [14:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:11] (03PS3) 10Vgutierrez: kafka: Ensure JVM AES intrinsics usage if AES ciphersuites are enabled [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) [14:54:56] (03CR) 10Vgutierrez: kafka: Ensure JVM AES intrinsics usage if AES ciphersuites are enabled (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [14:59:10] (03PS2) 10Volans: Puppet agent: fix redirect to syslog [puppet] - 10https://gerrit.wikimedia.org/r/440365 (https://phabricator.wikimedia.org/T191300) [15:12:21] (03CR) 10Ottomata: [C: 032] kafka: Ensure JVM AES intrinsics usage if AES ciphersuites are enabled [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [15:12:27] (03PS4) 10Ottomata: kafka: Ensure JVM AES intrinsics usage if AES ciphersuites are enabled [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [15:12:30] (03CR) 10Ottomata: [V: 032 C: 032] kafka: Ensure JVM AES intrinsics usage if AES ciphersuites are enabled [puppet] - 10https://gerrit.wikimedia.org/r/440520 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [15:13:48] (03CR) 10Dzahn: [C: 04-1] "even though noc and dbtree share a puppet class, doesnt mean the cache backends need to be switched at the same time. just move noc and le" [puppet] - 10https://gerrit.wikimedia.org/r/430527 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [15:14:11] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4293332 (10Papaul) @ayounsi @BBlack I am getting the network error message below during install on both lvs2009 and lvs2010. Please advice. Thanks.... [15:16:18] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4293344 (10Papaul) log on install2002 for lvs2010 DHCPDISCOVER from 00:0a:f7:f0:02:40 via 10.192.48.2 Jun 15 15:06:56 install2002 dhcpd[18272]: DHCPOFFER on 10.192.49.7 t... [15:18:20] (03PS2) 10Dzahn: cache::misc: switch noc.wm backend to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430527 (https://phabricator.wikimedia.org/T192092) [15:18:47] (03PS4) 10Nehajha: Read rcfile if it exists and parse arguments from it using configparser [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) [15:19:08] !log rolling restart of elasticsearch eqiad for plugin upgrade completed - T194245 [15:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:13] T194245: Implement searching of 'depicts' on commons with the 'quantity' qualifier - https://phabricator.wikimedia.org/T194245 [15:19:13] !log bouncing kafka broker on kafka-jumbo1001 to test https://gerrit.wikimedia.org/r/#/c/440520/ [15:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:39] (03PS1) 10Dzahn: noc/dbtree: require libapache-mod-php [puppet] - 10https://gerrit.wikimedia.org/r/440542 (https://phabricator.wikimedia.org/T192092) [15:26:18] (03CR) 10jerkins-bot: [V: 04-1] noc/dbtree: require libapache-mod-php [puppet] - 10https://gerrit.wikimedia.org/r/440542 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [15:26:54] (03CR) 10Dzahn: [C: 032] "for now still needs to support both terbium and mwmaint1001, hence the stretch and jessie support and php5... will be removed again once t" [puppet] - 10https://gerrit.wikimedia.org/r/440542 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [15:27:00] (03PS1) 10Anomie: Move CLI overrides after InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440543 (https://phabricator.wikimedia.org/T197475) [15:27:09] (03CR) 10Anomie: "This seems to have caused T197475" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393355 (owner: 10Reedy) [15:27:40] (03PS2) 10Dzahn: noc/dbtree: require libapache-mod-php [puppet] - 10https://gerrit.wikimedia.org/r/440542 (https://phabricator.wikimedia.org/T192092) [15:28:59] (03CR) 10Dzahn: [C: 032] noc/dbtree: require libapache-mod-php [puppet] - 10https://gerrit.wikimedia.org/r/440542 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [15:31:11] (03PS3) 10Dzahn: cache::misc: switch noc.wm backend to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430527 (https://phabricator.wikimedia.org/T192092) [15:34:53] jynus: FYI, I'll have to re-run that maintenance script for plwiki and ptwiki at some point, they errored out for some reason (unfortunately thanks to T197475 I don't know what that reason is, yet). Rough guess is about 8 hours to run both. [15:34:54] T197475: Wikimedia: Command-line scripts are saying to set $wgShowExceptionDetails - https://phabricator.wikimedia.org/T197475 [15:35:28] (03CR) 10Dzahn: [C: 032] cache::misc: switch noc.wm backend to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/430527 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [15:36:10] (03CR) 10Andrew Bogott: Read rcfile if it exists and parse arguments from it using configparser (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [15:37:35] !log switching noc.wikimedia.org site from terbium to mwamiant1001 backend, running puppet on all cache::misc cp servers (T192092) [15:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:40] T192092: setup replacement for terbium (maintenance_server) on stretch - https://phabricator.wikimedia.org/T192092 [15:37:51] anomie: maybe it is better to wait until tuesday [15:41:54] (03PS1) 10Vgutierrez: varnishkafka: Set TLS signature algorithms and curves lists [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) [15:45:18] (03CR) 10Nehajha: ">" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [15:45:52] (03CR) 10Andrew Bogott: "No trouble :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [15:46:07] (03CR) 10Vgutierrez: "pcc is pleased with this CR: https://puppet-compiler.wmflabs.org/compiler02/11536/" [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [15:46:53] (03CR) 10Vgutierrez: [C: 04-2] "Wait till librdkafka_0.11.3-1~bpo8+1+wikimedia2 is deployed" [puppet] - 10https://gerrit.wikimedia.org/r/440544 (https://phabricator.wikimedia.org/T182993) (owner: 10Vgutierrez) [15:47:28] (03CR) 10Andrew Bogott: "btw, it would be good to have an example/reference config file included in the docs someplace. Probably not as part of this patch though." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [15:47:48] 10Operations, 10Puppet, 10Patch-For-Review: Investigate landscape of PuppetDB Frontends and Provision One - https://phabricator.wikimedia.org/T184563#4293457 (10Aklapper) [15:49:09] (03CR) 10Nehajha: "> btw, it would be good to have an example/reference config file" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [15:49:48] !log install2002 - disabling puppet temp, live hackking DHCP config for debugging backup2001 install issue [15:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:25] (03PS5) 10Nehajha: Read rcfile if it exists and parse arguments from it using configparser [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) [15:57:41] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477#4293489 (10Papaul) @MoritzMuehlenhoff we are missing in our installer network drivers for the NIC card on this system (QLogic 10GE 2P QL41112HxCU-DE Adapter )... [16:02:26] (03CR) 10Alex Monk: "Interesting, it's working for me:" [puppet] - 10https://gerrit.wikimedia.org/r/437057 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:02:55] (03CR) 10Herron: [C: 032] "LGTM https://puppet-compiler.wmflabs.org/compiler02/11537/" [puppet] - 10https://gerrit.wikimedia.org/r/439451 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [16:03:00] (03PS3) 10Herron: Followup If545182a: Actually use cert_name now [puppet] - 10https://gerrit.wikimedia.org/r/439451 (https://phabricator.wikimedia.org/T184244) (owner: 10Alex Monk) [16:09:54] (03CR) 10Volans: [C: 032] Puppet agent: fix redirect to syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440365 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [16:09:59] (03PS3) 10Volans: Puppet agent: fix redirect to syslog [puppet] - 10https://gerrit.wikimedia.org/r/440365 (https://phabricator.wikimedia.org/T191300) [16:14:19] (03CR) 10Jcrespo: [C: 031] "Everything that is here is ok, I just don't know yet if everthing that is to be doen is here." [puppet] - 10https://gerrit.wikimedia.org/r/437720 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [16:26:19] (03PS1) 10Dzahn: icinga/wikidata: add secondary dispatcher critical check [puppet] - 10https://gerrit.wikimedia.org/r/440549 [16:31:39] (03PS1) 10Dzahn: icinga/wikidata: fix regex for wikidata dispatcher check [puppet] - 10https://gerrit.wikimedia.org/r/440550 [16:33:00] (03PS2) 10Dzahn: icinga/wikidata: fix regex for wikidata dispatcher check [puppet] - 10https://gerrit.wikimedia.org/r/440550 [16:37:43] (03CR) 10Ladsgroup: [C: 031] ":D" [puppet] - 10https://gerrit.wikimedia.org/r/440550 (owner: 10Dzahn) [16:39:56] (03CR) 10Dzahn: [C: 032] icinga/wikidata: fix regex for wikidata dispatcher check [puppet] - 10https://gerrit.wikimedia.org/r/440550 (owner: 10Dzahn) [16:40:13] idk if this is the right place to ask but there's a strange problem with twinkle but only for 3rr? [16:45:52] (03PS12) 10Krinkle: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 [16:45:58] (03CR) 10Krinkle: "Rebased, again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [16:51:44] (03PS2) 10Dzahn: icinga/wikidata: add secondary dispatcher critical check [puppet] - 10https://gerrit.wikimedia.org/r/440549 [16:54:59] (03PS3) 10Dzahn: icinga/wikidata: add secondary dispatcher critical check [puppet] - 10https://gerrit.wikimedia.org/r/440549 [16:55:48] (03CR) 10Dzahn: "the definition of critical in this context would be like "a server admin is expected to do something asap"" [puppet] - 10https://gerrit.wikimedia.org/r/440549 (owner: 10Dzahn) [16:56:05] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477#4293662 (10Papaul) [17:01:09] (03CR) 10Dzahn: [C: 032] icinga/wikidata: add secondary dispatcher critical check [puppet] - 10https://gerrit.wikimedia.org/r/440549 (owner: 10Dzahn) [17:04:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [17:04:37] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [17:05:06] are those expected XioNoX? [17:05:31] schedules downtime for 'correctness of icinga config' adding a new check command.. [17:05:38] but will be quick [17:06:08] meh, just needed 2 puppet runs. not even worth it.done [17:07:05] herron: checking, but even if not expected we have redundancy [17:07:35] cool thx [17:08:05] (03CR) 10Krinkle: [C: 032] mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [17:08:36] * Krinkle stages on mwdebug1002 [17:09:12] (03PS7) 10Krinkle: profiler-labs: Use FlameGraph-compatible format for xhprof sampler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/434522 (https://phabricator.wikimedia.org/T176916) [17:09:29] herron: btw, https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down [17:09:51] (03Merged) 10jenkins-bot: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [17:09:56] nice! thanks reading through this now [17:10:05] (03CR) 10jenkins-bot: mc-labs: Sync with prod or document differences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437876 (owner: 10Krinkle) [17:11:04] herron: I can't find any planned maintenance. I'm on my phone, let me know if anything degrades [17:11:17] XioNoX: will do thanks [17:15:50] !log krinkle@deploy1001 Synchronized wmf-config/mc.php: I619a2ff5db611 (duration: 00m 58s) [17:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:08] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4293707 (10herron) Hello @MSantos! To provision access we need to list down the specific group memberships that are requested. Could you please coordinate the gathering of this infor... [17:17:13] (03PS1) 10Dzahn: icinga/wikidata: add new check_command for dispatcher [puppet] - 10https://gerrit.wikimedia.org/r/440552 [17:17:48] (03PS2) 10Dzahn: icinga/wikidata: add new check_command for dispatcher [puppet] - 10https://gerrit.wikimedia.org/r/440552 [17:18:52] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4293712 (10herron) p:05Triage>03Normal [17:18:58] (03CR) 10Dzahn: [C: 032] icinga/wikidata: add new check_command for dispatcher [puppet] - 10https://gerrit.wikimedia.org/r/440552 (owner: 10Dzahn) [17:20:49] (03PS1) 10Krinkle: logging: Raise minimum level for 'preferences' to INFO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440553 [17:32:38] * Krinkle done on mwlog1002 [17:32:45] * Krinkle testing stuff on mwlog1001 [17:57:54] 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: dbtree broken (for some users?) - https://phabricator.wikimedia.org/T162976#4293777 (10jcrespo) [17:57:58] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#4293775 (10jcrespo) 05Open>03stalled This is stalled because tendril cannot work with multiple db backends. We would need to setup a different backend to support it- w... [17:58:01] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#4293778 (10jcrespo) [18:03:26] ACKNOWLEDGEMENT - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: Herron Telia Carrier Reference: 00862426. We regret to inform you that we are currently experiencing a major outage in New York. The issue is suspected to be caused by a fiber cut. [18:03:26] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: Herron Telia Carrier Reference: 00862426. We regret to inform you that we are currently experiencing a major outage in New York. The issue is suspected to be caused by a fiber cut. [18:10:29] !log reindexing Bosnian wikis on elastic@codfw (T196658) [18:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:32] T196658: Re-index Croatian, Serbo-Croatian, and Bosnian Wikis - https://phabricator.wikimedia.org/T196658 [18:19:22] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [18:20:45] (03PS1) 10Bmansurov: Enable logging for Schema:CitationUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440557 [18:21:03] (03PS2) 10Bmansurov: Enable logging for Schema:CitationUsage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440557 (https://phabricator.wikimedia.org/T191086) [18:21:34] (03CR) 10Bmansurov: [C: 04-1] "To be deployed on 6/21." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440557 (https://phabricator.wikimedia.org/T191086) (owner: 10Bmansurov) [18:22:41] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:23:44] (03CR) 10EBernhardson: Prep work for multi-instance elasticsearch refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [18:26:28] (03PS2) 10EBernhardson: Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 [18:26:30] (03PS29) 10EBernhardson: [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 [18:27:45] (03CR) 10jerkins-bot: [V: 04-1] Prep work for multi-instance elasticsearch refactor [puppet] - 10https://gerrit.wikimedia.org/r/440498 (owner: 10EBernhardson) [18:27:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Allow multiple elasticsearch instances per host [puppet] - 10https://gerrit.wikimedia.org/r/440049 (owner: 10EBernhardson) [18:45:12] RECOVERY - Memory correctable errors -EDAC- on scb1002 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [18:46:31] (03CR) 10Andrew Bogott: [C: 032] Read rcfile if it exists and parse arguments from it using configparser [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [18:47:21] (03Merged) 10jenkins-bot: Read rcfile if it exists and parse arguments from it using configparser [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/437164 (https://phabricator.wikimedia.org/T148872) (owner: 10Nehajha) [18:48:52] !log reindexing Bosnian wikis on elastic@eqiad (T196658) [18:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:54] T196658: Re-index Croatian, Serbo-Croatian, and Bosnian Wikis - https://phabricator.wikimedia.org/T196658 [18:50:52] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#4293838 (10Krinkle) [18:51:08] 10Operations, 10DBA, 10Traffic, 10Availability: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3187493 (10Krinkle) [19:05:58] (03PS2) 10Aaron Schulz: Make mediawiki.org write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440469 [19:39:21] (03PS1) 10Alex Monk: deployment-prep: Fix shinken check for Citoid [puppet] - 10https://gerrit.wikimedia.org/r/440561 [19:39:37] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (shinken-beta-citoid)$ git review [19:39:37] Problem running 'git remote update gerrit' [19:39:38] Fetching gerrit [19:39:39] fatal: internal server error [19:41:11] (03PS1) 10Alex Monk: shinkengen: Ignore instances that are turned off in Nova [puppet] - 10https://gerrit.wikimedia.org/r/440562 [19:46:32] (03CR) 10Alex Monk: "krenair@shinken-01:~$ /usr/lib/nagios/plugins/check_http -H deployment-sca02 -p 1970 -u /_info" [puppet] - 10https://gerrit.wikimedia.org/r/440561 (owner: 10Alex Monk) [19:48:50] (03CR) 10Alex Monk: "(think it's actually -I instead of -H but it still appears to work)" [puppet] - 10https://gerrit.wikimedia.org/r/440562 (owner: 10Alex Monk) [20:19:12] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [20:22:32] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:23:02] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [20:26:21] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [21:04:56] (03CR) 10Krinkle: Move CLI overrides after InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440543 (https://phabricator.wikimedia.org/T197475) (owner: 10Anomie) [21:19:12] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [21:22:32] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:45:39] (03PS1) 10Ladsgroup: snapshot: fix css used to show report cards [puppet] - 10https://gerrit.wikimedia.org/r/440613 [22:24:22] (03CR) 10Krinkle: [C: 04-1] Greatly simplify svn.wikimedia.org redirects [puppet] - 10https://gerrit.wikimedia.org/r/429449 (owner: 10Chad) [22:37:40] (03PS1) 10Mooeypoo: Enable RCFilters by default on Watchlist in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440629 (https://phabricator.wikimedia.org/T181193) [22:40:19] (03CR) 10Catrope: [C: 032] Enable RCFilters by default on Watchlist in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440629 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [22:41:54] (03Merged) 10jenkins-bot: Enable RCFilters by default on Watchlist in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440629 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [22:42:11] (03CR) 10jenkins-bot: Enable RCFilters by default on Watchlist in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440629 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [23:06:08] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473#4294361 (10Smalyshev) [23:18:02] (03PS1) 10Mooeypoo: Rollout Watchlist Structured Filters to most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440641 (https://phabricator.wikimedia.org/T181193) [23:21:28] (03PS1) 10Mooeypoo: Rollout Watchlist Structured Filters to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440642 (https://phabricator.wikimedia.org/T181193) [23:22:09] (03CR) 10Catrope: [C: 04-2] "Not before June 25th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440641 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [23:22:16] (03Abandoned) 10Aaron Schulz: [DNM] Set "mcrouterAware" flag for "memcached-mcrouter" object cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440039 (owner: 10Aaron Schulz) [23:22:31] (03CR) 10Catrope: [C: 04-2] "Not before June 25th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440642 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [23:25:51] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Ban clients of WDQS which don't follow throttling directives for some time - https://phabricator.wikimedia.org/T194653#4294384 (10Smalyshev) p:05Triage>03Normal [23:26:06] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Ban clients of WDQS which don't follow throttling directives for some time - https://phabricator.wikimedia.org/T194653#4204417 (10Smalyshev) 05Open>03Resolved [23:47:03] (03PS1) 10Aaron Schulz: Make mc-labs.php settings more similar to mc.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440643 [23:50:50] (03CR) 10Krinkle: [C: 032] "Confirmed that aside from the one message being changed, there are 0 hits for level-DEBUG in production channel 'preferences' over the pas" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440553 (owner: 10Krinkle) [23:55:30] (03PS2) 10Krinkle: logging: Raise minimum level for 'preferences' to INFO [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440553 [23:56:12] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200) [23:58:21] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy [23:58:36] elukey: Should the old puppet/varnishkafka repo be emptied/archived? [23:59:46] (03PS1) 10Krinkle: Archive repository [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/440644