[00:00:05] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T0000). Please do the needful. [00:00:05] awight: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:02:29] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [00:03:27] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [00:05:15] bstorm_: ^ Are these dbproxy alerts related? I don't know which proxy points to which cluster. [00:05:49] bd808: m5 is db1133 [00:06:10] I don't know if it uses dbproxy setups at all. I can check those [00:06:58] For those watching, the labweb* alerts above are database related (m5 cluster). [00:07:20] "We" are looking into it [00:07:39] * bd808 throws the air quotes for his helpfulness in the process [00:07:50] dbproxy1021 is the proxy for m5 [00:08:16] can I put a python dict/JSON object as a hiera value, and if so, does anyone have an example I can look at/copy from? [00:08:27] I'm betting they both proxy m5 [00:08:35] bstorm_: seems likely [00:09:24] It is. Just verified [00:09:40] legoktm: hiera is yaml, so yes. You can have a key that points to a dict. I think I know where an example of that is. One sec. [00:12:09] !log set max_connections to 600 temporarily while troubleshooting on m5 (db1133) [00:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:47] legoktm: here is a dict in hiera that Striker uses -- https://github.com/wikimedia/puppet/blob/production/hieradata/role/eqiad/wmcs/openstack/eqiad1/labweb.yaml#L14-L47 [00:13:20] thank you :) [00:14:11] The module that consumes that is not using the fancy new profile setup, so that is the $config param to the striker::uwsgi module at runtime [00:19:34] legoktm: you can put that in ./hierdata/labs/codesearch/common.yaml as something like profile::codesearch::foobar [00:20:13] (03CR) 10Arlolra: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564805 (https://phabricator.wikimedia.org/T239806) (owner: 10Arlolra) [00:21:56] (03PS1) 10Legoktm: codesearch: Alias production branch to master in /srv/puppet [puppet] - 10https://gerrit.wikimedia.org/r/564809 (https://phabricator.wikimedia.org/T242319) [00:22:30] !log restarted maintain-dbusers on labstore1004 after recovering the m5 DB's connection issue [00:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:32] mutante: thanks, I'll give that a shot. I was looking at putting https://gerrit.wikimedia.org/r/plugins/gitiles/labs/codesearch/+/master/ports.py into hiera and then generating the systemd units from that [00:23:57] (03CR) 10Legoktm: "I tested this by doing a fresh clone of operations/puppet, inspecting the relevant .git directory to ensure the file didn't exist, running" [puppet] - 10https://gerrit.wikimedia.org/r/564809 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [00:25:24] legoktm: hmm.. not sure if you want to go so far to have puppet generate that. it would just be on the first run and you could as well just create them once and add them as normal files/templates expected by systemd::service [00:25:56] looks at the other change [00:27:07] mutante: I was thinking that the advantage of doing it via heira was that I could add new ones without needing to get a puppet patch merged since I can edit the heira from horizon [00:31:47] legoktm: the change with the git exec looks like it will work. though tbh i'd prefer using command => for the actual command and use some descriptive language as the resource title [00:32:01] ok, I can do that [00:33:30] legoktm: yea, i see that advantage though if that doesn't happen a lot i'd still put everything in the repo. editing Horizon also creates auto-commits / phabricator notifications and if you want to debug you have to look in multiple places for Hiera. [00:33:44] both will work though [00:34:02] (03PS2) 10Legoktm: codesearch: Alias production branch to master in /srv/puppet [puppet] - 10https://gerrit.wikimedia.org/r/564809 (https://phabricator.wikimedia.org/T242319) [00:34:54] (03CR) 10Dzahn: [C: 03+2] codesearch: Alias production branch to master in /srv/puppet [puppet] - 10https://gerrit.wikimedia.org/r/564809 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [00:37:02] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Bstorm) [00:37:05] it's pretty rare now I suppose [00:37:13] legoktm: i think i also misunderstood you a bit. ... yes, put the port numbers in Hiera and use the values in an .erb template for the systemd unit. if it doesn't change a lot i'd keep them in hieradata/labs/ though [00:37:34] ok, sounds good :) [00:37:52] Error: Failed to apply catalog: Validation of Exec[puppet alias origin/master] failed: 'git symbolic-ref refs/remotes/origin/master refs/remotes/origin/production' is not qualified and no path was specified. Please qualify the command or specify a path. (file: /etc/puppet/modules/codesearch/manifests/init.pp, line: 64) [00:38:09] Maybe I need /usr/bin/git ? [00:38:18] oh.. actually i forgot to say that [00:38:27] i had "maybe use full path" on the tip of my tongue [00:38:51] you can specify path => [00:39:04] like path => ['/usr/bin', '/usr/sbin',], [00:39:07] (03PS1) 10Legoktm: codesearch: Use full path to git in exec commands [puppet] - 10https://gerrit.wikimedia.org/r/564810 [00:39:32] https://codesearch.wmflabs.org/operations/?q=command&i=nope&files=%5C.pp&repos= looks like most things just use the full path to the command [00:40:11] yea. nice how you can use ..codesearch.. for that :) [00:40:20] (03CR) 10Dzahn: [C: 03+2] codesearch: Use full path to git in exec commands [puppet] - 10https://gerrit.wikimedia.org/r/564810 (owner: 10Legoktm) [00:40:27] :P [00:40:56] my codesearch is honestly still grep -r most of the time [00:41:35] alright, try now. [00:43:26] Notice: /Stage[main]/Codesearch/Exec[puppet alias origin/master]/returns: executed successfully [00:43:26] Notice: /Stage[main]/Codesearch/Exec[puppet alias master]/returns: executed successfully [00:43:38] and `git show-ref` looks correct [00:43:47] ty :)) [00:43:55] cool! [00:43:58] yw [00:44:00] I'll work on a patch for the other ports/systemd stuff later tonight [00:44:26] ack, ttyl then [00:47:43] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:54:07] from looking at the grafana link..icinga should recover any moment [01:03:07] 10Operations, 10ORES, 10Scoring-platform-team: ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Dzahn) [01:12:55] 10Operations: Add Daimona to #mediawiki_security - https://phabricator.wikimedia.org/T239093 (10RobH) 05Open→03Resolved a:05RobH→03None so anyone with op in #mediawiki_security can do this, not just me. **These do NOT need to come to me every time, this should be handled by clinic duty in the future.*... [01:13:44] !log dbproxy1017 - systemctl reload haproxy [01:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:17] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [01:14:22] bd808: fixed that ^ [01:15:05] (03PS1) 10Papaul: DHCP: Change logstash202[6-9] MAC address from 1G NIC MAC to 10GB MAC [puppet] - 10https://gerrit.wikimedia.org/r/564814 (https://phabricator.wikimedia.org/T240882) [01:16:57] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [01:17:11] !log dbproxy1017 and dbproxy1021 were showing "haproxy failover" icinga alerts. did the check described on https://wikitech.wikimedia.org/wiki/HAProxy#Failover and it claimed on both that db1133 was DOWN..but checking db1133 itself showed it was up and working normal. in that case the docs said to 'systemctl reload haproxy'. done on both and things recovered [01:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:56] (03PS1) 10Bstorm: neutron: constrain the maximum SQLAlchemy overflow value [puppet] - 10https://gerrit.wikimedia.org/r/564817 (https://phabricator.wikimedia.org/T242817) [01:17:58] (03CR) 10Dzahn: [C: 03+2] DHCP: Change logstash202[6-9] MAC address from 1G NIC MAC to 10GB MAC [puppet] - 10https://gerrit.wikimedia.org/r/564814 (https://phabricator.wikimedia.org/T240882) (owner: 10Papaul) [01:18:12] bstorm_: so i fixed the haproxy db alert thing that affected m5 [01:18:35] i guess the labweb* should also be fine then [01:19:29] 10Operations, 10ops-codfw: audit all codfw pdu tower draws - https://phabricator.wikimedia.org/T163362 (10RobH) 05Open→03Invalid a:05RobH→03None >>! In T163362#5105971, @Dzahn wrote: > duplicate of T163339 ? Yep! [01:19:35] papaul: you can go ahead with installer [01:19:35] The DB was actually "down" :) [01:19:41] T242817 [01:19:42] T242817: m5 ran out of connections after openstack upgrade to "Pike" - https://phabricator.wikimedia.org/T242817 [01:19:45] We broke it. [01:19:50] mutante: thanks [01:19:58] But thank you for kicking the proxy over as well :) [01:20:33] I hadn't bothered with it and have just been focused on the DB [01:21:00] 10Operations, 10DC-Ops: Cleanup/delete recycled and returned (lease tranche 1) hardware from Netbox - https://phabricator.wikimedia.org/T223467 (10RobH) 05Open→03Resolved a:05RobH→03None this is now being handled as part of T236972 [01:21:17] bstorm_: oh! i see. well the docs have the same case for "original server has recovered". alright [01:21:47] Legit [01:22:26] let me fix the path to the socket in that wiki page though, it's different on these prod servers today [01:23:30] papaul: yw [01:32:16] !log lvs1015 powercycling, crashed, nothing on console, lots of unknowns in icinga [01:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:13] RECOVERY - Host lvs1015 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [01:39:39] ^ ehh..yea.. that crashed 4 hours ago it looks but it wasn't very noticable because most of the service checks became UNKNOWN and not CRIT [01:39:43] and that means it's not on IRC [01:40:02] but nevertheless noticable on Icinga web UI and looking a lot better now [01:40:15] i commented on -traffic [01:42:58] out for now [01:45:03] RECOVERY - snapshot of s4 in eqiad on db1115 is OK: snapshot for s4 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2020-01-14 23:29:21 from db1102.eqiad.wmnet:3314 (1077 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [02:15:54] (03CR) 10Andrew Bogott: [C: 03+1] "Looks good to me -- I think we could drop the api workers quite a bit more if we want." [puppet] - 10https://gerrit.wikimedia.org/r/564817 (https://phabricator.wikimedia.org/T242817) (owner: 10Bstorm) [02:22:19] (03PS1) 10Papaul: Partman: Add logstash202[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/564826 (https://phabricator.wikimedia.org/T240882) [02:24:36] (03CR) 10Papaul: [C: 03+2] Partman: Add logstash202[6-9] [puppet] - 10https://gerrit.wikimedia.org/r/564826 (https://phabricator.wikimedia.org/T240882) (owner: 10Papaul) [02:39:20] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Papaul) [03:15:14] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Papaul) [03:16:00] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi all yours [03:21:47] (03PS2) 10Krinkle: Reapply "ATS: unset Accept-Encoding" [puppet] - 10https://gerrit.wikimedia.org/r/564005 (https://phabricator.wikimedia.org/T242478) (owner: 10Ema) [03:39:00] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/564817 (https://phabricator.wikimedia.org/T242817) (owner: 10Bstorm) [03:47:09] (03PS1) 10Papaul: DHCP: Add Mac address for puppetmaster2003 [puppet] - 10https://gerrit.wikimedia.org/r/564836 (https://phabricator.wikimedia.org/T239732) [03:49:23] (03CR) 10Papaul: [C: 03+2] DHCP: Add Mac address for puppetmaster2003 [puppet] - 10https://gerrit.wikimedia.org/r/564836 (https://phabricator.wikimedia.org/T239732) (owner: 10Papaul) [04:04:21] (03CR) 10Andrew Bogott: [C: 03+1] "It should be fine -- it will probably cause a failover when it applies so best to apply to the inactive node first before it switches." [puppet] - 10https://gerrit.wikimedia.org/r/564817 (https://phabricator.wikimedia.org/T242817) (owner: 10Bstorm) [04:27:36] (03PS2) 10Andrew Bogott: neutron: constrain the maximum SQLAlchemy overflow value [puppet] - 10https://gerrit.wikimedia.org/r/564817 (https://phabricator.wikimedia.org/T242817) (owner: 10Bstorm) [04:28:45] (03CR) 10Andrew Bogott: [C: 03+2] neutron: constrain the maximum SQLAlchemy overflow value [puppet] - 10https://gerrit.wikimedia.org/r/564817 (https://phabricator.wikimedia.org/T242817) (owner: 10Bstorm) [04:43:21] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561918 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [04:46:50] (03CR) 10CRusnov: [C: 03+1] "looks good" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561603 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [04:52:28] (03CR) 10CRusnov: [C: 03+1] "I appreciate the necessity of this but it seems mildly uncomfortable to hard code frack and mgmt subdomains in the splitter function (alth" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561602 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [05:09:19] (03CR) 10CRusnov: [C: 03+1] "since these are produced from the asset tags in Netbox, I'm sure there is a level of accuracy, however, there are somewhat more mismatches" [dns] - 10https://gerrit.wikimedia.org/r/561925 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [05:10:35] (03CR) 10CRusnov: [C: 03+1] "+1 modulo previous discussion about getting the constants from the API instead of hardcoding them (iirc we agreed that this is fine since " [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561917 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [05:23:55] (03CR) 10CRusnov: [C: 03+1] "This is a more confident +1 because these are missing not different." [dns] - 10https://gerrit.wikimedia.org/r/561856 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [05:37:11] (03PS2) 10Jforrester: Enable ORES on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489936 (https://phabricator.wikimedia.org/T215354) (owner: 10Catrope) [05:38:16] (03CR) 10Jforrester: [C: 04-2] "Blocked on community consultation." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/489936 (https://phabricator.wikimedia.org/T215354) (owner: 10Catrope) [05:52:07] (03PS1) 10Andrew Bogott: make-instance-vg: check if lvm is present before we start [puppet] - 10https://gerrit.wikimedia.org/r/564847 (https://phabricator.wikimedia.org/T241868) [05:54:06] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@3c5f615]: Update mobileapps to 7f507ae [05:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:02] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@3c5f615]: Update mobileapps to 7f507ae (duration: 05m 56s) [06:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:18] (03PS1) 10Marostegui: install_server: Add entry for es2021 [puppet] - 10https://gerrit.wikimedia.org/r/564849 (https://phabricator.wikimedia.org/T241336) [06:03:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P10148 and previous config saved to /var/cache/conftool/dbconfig/20200115-060347-marostegui.json [06:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1103:3312 - T239453', diff saved to https://phabricator.wikimedia.org/P10150 and previous config saved to /var/cache/conftool/dbconfig/20200115-061052-marostegui.json [06:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:57] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:13:49] (03CR) 10Marostegui: [C: 03+2] install_server: Add entry for es2021 [puppet] - 10https://gerrit.wikimedia.org/r/564849 (https://phabricator.wikimedia.org/T241336) (owner: 10Marostegui) [06:14:42] 10Operations, 10ops-codfw, 10DBA: (Needed By 31st January) codfw: rack/setup/install es202[0-5].codfw.wmnet - https://phabricator.wikimedia.org/T241336 (10Marostegui) [06:16:51] (03PS1) 10Mholloway: MachineVision: Make testcommonswiki behavior consistent with commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564852 [06:16:57] !log Remove revision partitions from db2088:3311 - T239453 [06:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:00] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:19:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P10151 and previous config saved to /var/cache/conftool/dbconfig/20200115-061859-marostegui.json [06:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:03] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:20:06] PROBLEM - MariaDB Slave Lag: s4 #page on db1081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 163868.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:20:11] (03CR) 10Mholloway: [C: 03+2] MachineVision: Make testcommonswiki behavior consistent with commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564852 (owner: 10Mholloway) [06:20:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3316 db1098:3317 for upgrade', diff saved to https://phabricator.wikimedia.org/P10152 and previous config saved to /var/cache/conftool/dbconfig/20200115-062028-marostegui.json [06:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:34] Checking db1081 [06:21:05] (03Merged) 10jenkins-bot: MachineVision: Make testcommonswiki behavior consistent with commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564852 (owner: 10Mholloway) [06:21:09] Ah, downtime expired I think [06:21:39] Yes, expired downtime [06:21:59] ah ha [06:22:48] Going to disable notifications for it [06:22:55] It will be a few more days like that [06:23:50] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Make testcommonswiki behavior consistent with commonswiki (duration: 01m 16s) [06:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:13] (03PS1) 10Marostegui: db1081: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/564853 (https://phabricator.wikimedia.org/T232446) [06:25:11] !log Upgrade db1098:3316 and db1098:3317 [06:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:13] (03CR) 10Marostegui: [C: 03+2] db1081: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/564853 (https://phabricator.wikimedia.org/T232446) (owner: 10Marostegui) [06:28:21] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:45:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317', diff saved to https://phabricator.wikimedia.org/P10155 and previous config saved to /var/cache/conftool/dbconfig/20200115-064535-marostegui.json [06:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P10156 and previous config saved to /var/cache/conftool/dbconfig/20200115-064606-marostegui.json [06:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1080', diff saved to https://phabricator.wikimedia.org/P10157 and previous config saved to /var/cache/conftool/dbconfig/20200115-065305-marostegui.json [06:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317', diff saved to https://phabricator.wikimedia.org/P10158 and previous config saved to /var/cache/conftool/dbconfig/20200115-065353-marostegui.json [06:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1098:3316 and db1098:3317', diff saved to https://phabricator.wikimedia.org/P10159 and previous config saved to /var/cache/conftool/dbconfig/20200115-070201-marostegui.json [07:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:07] (03PS1) 10Legoktm: codesearch: Migrate ./write_config.py cron job to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/564857 (https://phabricator.wikimedia.org/T242319) [07:02:09] (03PS1) 10Legoktm: codesearch: Generate hound-${name} systemd units [puppet] - 10https://gerrit.wikimedia.org/r/564858 (https://phabricator.wikimedia.org/T242319) [07:02:51] (03CR) 10jerkins-bot: [V: 04-1] codesearch: Migrate ./write_config.py cron job to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/564857 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [07:03:08] (03CR) 10jerkins-bot: [V: 04-1] codesearch: Generate hound-${name} systemd units [puppet] - 10https://gerrit.wikimedia.org/r/564858 (https://phabricator.wikimedia.org/T242319) (owner: 10Legoktm) [07:04:10] (03PS2) 10Legoktm: codesearch: Migrate ./write_config.py cron job to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/564857 (https://phabricator.wikimedia.org/T242319) [07:04:12] (03PS2) 10Legoktm: codesearch: Generate hound-${name} systemd units [puppet] - 10https://gerrit.wikimedia.org/r/564858 (https://phabricator.wikimedia.org/T242319) [07:08:07] (03PS3) 10Legoktm: codesearch: Generate hound-${name} systemd units [puppet] - 10https://gerrit.wikimedia.org/r/564858 (https://phabricator.wikimedia.org/T242319) [07:17:36] (03Abandoned) 10Elukey: admin: update user piccardi's ssh public key [puppet] - 10https://gerrit.wikimedia.org/r/564747 (https://phabricator.wikimedia.org/T151969) (owner: 10Elukey) [07:32:57] (03CR) 10Elukey: "Thanks for the code review, very nice idea. I'd need to test it and see what my team wants to do, but we really appreciate the help!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [07:35:17] (03CR) 10Elukey: "Before proceeding I'd need to verify this solution with my team, since we wanted to get rid of the /v2 part of the URL if possible. With t" [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [07:40:32] (03CR) 10Elukey: "> Directing at /v2/ as Saper's alternative patch might be better" [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) (owner: 10Elukey) [07:48:17] (03PS1) 10Ayounsi: Add ping3001 to DNS [dns] - 10https://gerrit.wikimedia.org/r/564864 (https://phabricator.wikimedia.org/T190090) [07:48:39] (03CR) 10jerkins-bot: [V: 04-1] Add ping3001 to DNS [dns] - 10https://gerrit.wikimedia.org/r/564864 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [07:50:03] (03PS2) 10Ayounsi: Add ping3001 to DNS [dns] - 10https://gerrit.wikimedia.org/r/564864 (https://phabricator.wikimedia.org/T190090) [07:50:24] (03CR) 10jerkins-bot: [V: 04-1] Add ping3001 to DNS [dns] - 10https://gerrit.wikimedia.org/r/564864 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [07:52:46] (03PS3) 10Ayounsi: Add ping3001 to DNS [dns] - 10https://gerrit.wikimedia.org/r/564864 (https://phabricator.wikimedia.org/T190090) [07:53:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/564129 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [07:55:15] 10Operations: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10MoritzMuehlenhoff) p:05Triage→03Normal [07:56:16] 10Operations, 10Traffic: ATS strict round robin parent select policy doesn't work as expected - https://phabricator.wikimedia.org/T242778 (10MoritzMuehlenhoff) p:05Triage→03Normal [07:58:33] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: Move cassandra logging to logging pipeline - https://phabricator.wikimedia.org/T242585 (10MoritzMuehlenhoff) p:05Triage→03Normal [07:58:54] 10Operations, 10Wikimedia-Portals, 10Regression: www.wikipedia.org/robots.txt should not be a redirect - https://phabricator.wikimedia.org/T242500 (10MoritzMuehlenhoff) p:05Triage→03Normal [07:59:26] apergos: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/502527 I'll puppet-merge, run puppet on ores* then roll-restart uwsgi-ores and celery-worker-ores as per https://wikitech.wikimedia.org/wiki/ORES/Deployment#Puppet-managed_config_changes seems reasonable ? [07:59:35] nooope that was meant for akosiaris ^ [08:00:49] 10Operations: Add POP Ganeti clusters to makevm cookbook - https://phabricator.wikimedia.org/T242828 (10ayounsi) p:05Triage→03Low [08:00:52] 10Operations, 10Continuous-Integration-Infrastructure: Package python3.8 for stretch-wikimedia pyall component - https://phabricator.wikimedia.org/T241195 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff >>! In T241195#5793136, @faidon wrote: > I've updated the aforementioned apt repository with 3.8.1-2~buster1 p... [08:02:19] godog: :D [08:02:35] thanks, I 've been looking for that [08:03:43] sudo cumin -s 10 -b2 'ores2*' 'puppet agent -t ; systemctl restart celery-ores-worker' fwiw [08:03:57] and ores1* ofc [08:05:02] neat, thanks akosiaris ! trying now [08:05:24] more correctly, systemctl restart uwsgi-ores celery-ores-worker [08:05:46] (03CR) 10Filippo Giunchedi: [C: 03+2] ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/502527 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [08:06:01] and probalby vol.ans will chastise me for not doing it in a 'puppet agent -t' 'systemctl restart uwgi-ores celery-ores-worker' way but old habits die hard [08:06:57] lolz [08:10:05] 10Operations: Add POP Ganeti clusters to makevm cookbook - https://phabricator.wikimedia.org/T242828 (10MoritzMuehlenhoff) This should be fixed when Valentín's https://gerrit.wikimedia.org/r/#/c/operations/software/spicerack/+/563132/ patch is merged. [08:13:55] !log testing ores logging to pipeline on ores2001 [08:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:53] (03PS1) 10Filippo Giunchedi: WIP: remove json_lines tcp [puppet] - 10https://gerrit.wikimedia.org/r/564866 [08:19:55] (03PS1) 10Filippo Giunchedi: hieradata: use logging pipeline for ores uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/564867 (https://phabricator.wikimedia.org/T213899) [08:21:46] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/20358/ores2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/564867 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [08:22:41] akosiaris: the ores part works as expected, the missing bit is uwsgi which is fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/564867 [08:23:03] !log roll restart ores in codfw/eqiad to apply logging pipeline changes [08:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:32] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: use logging pipeline for ores uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/564867 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [08:26:29] (03PS6) 10Elukey: aqs: replace logstash host/port with rsyslog localhost/port [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) [08:27:04] 10Operations: Add POP Ganeti clusters to makevm cookbook - https://phabricator.wikimedia.org/T242828 (10ayounsi) Slightly related, the `makevm` script on the Ganeti clusters only accepts a single character in the "row" question: ` Please enter the correct row. (A, B or C - gnt-group list to show) O How many vCP... [08:27:15] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for jennifer wang (jwang) - https://phabricator.wikimedia.org/T242807 (10MoritzMuehlenhoff) @jwang: The shell username "jwang" is already taken: Please create an account in "Cloud VPS" first following https... [08:27:24] godog: yesterday Dan deployed the new version of service runner so I think I can go ahead with AQS --^ [08:27:42] is there a quick/standard way that you use to check if the new logging pipeline is used? [08:28:04] (I am just trying to find a good way to make sure that the patch works when applied) [08:29:34] elukey: sweet, thanks! yeah you'll see in the message tags 'input-kafka-' when it has been received via kafka [08:29:44] (03CR) 10Ayounsi: [C: 03+2] Add ping3001 to DNS [dns] - 10https://gerrit.wikimedia.org/r/564864 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [08:31:43] (03CR) 10Elukey: [C: 03+2] aqs: replace logstash host/port with rsyslog localhost/port [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [08:33:03] (03CR) 10Muehlenhoff: [C: 03+1] "Patch is fine, but waiting for Nuria." [puppet] - 10https://gerrit.wikimedia.org/r/562940 (https://phabricator.wikimedia.org/T241838) (owner: 10Dzahn) [08:33:23] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: use logging pipeline for ores uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/564867 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [08:33:32] (03PS2) 10Filippo Giunchedi: hieradata: use logging pipeline for ores uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/564867 (https://phabricator.wikimedia.org/T213899) [08:35:07] (03PS3) 10Ema: Reapply "ATS: unset Accept-Encoding" [puppet] - 10https://gerrit.wikimedia.org/r/564005 (https://phabricator.wikimedia.org/T242478) [08:40:40] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [08:40:40] !log elukey@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) [08:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:47] !log elukey@cumin1001 START - Cookbook sre.aqs.roll-restart [08:40:47] !log roll restart ores in codfw/eqiad to apply logging pipeline changes [08:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:37] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) [08:42:59] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) 05Stalled→03Open Dan deployed the new version of service-runner for aqs, I applied the puppet patch and verified that the ne... [08:43:02] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10elukey) [08:43:08] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:43:15] on it--^ [08:43:22] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:43:46] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:43:50] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:43:54] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:44:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [08:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:16] godog: aqs done! [08:44:29] elukey: amazing!! [08:44:58] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:10] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [08:45:29] for stat1007, there was a python process that tried to allocate a ton of ram, but it was promptly killed by the cgroups limits [08:45:36] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:45:38] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [08:45:42] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [08:46:01] the main issue is that those limits are per user, so if multiple ones are hammering the host we end up in spurious alarms [08:46:04] uff [08:46:11] will need to re-review those limits [08:47:29] random thought: maybe we can put all of those processes in a systemd slice and have limits on that instead, then it should apply to all child cgroups [08:48:10] I thought about that, but didn't find a reasonable way to do it yet.. I was able to apply limits to each user slice, but not to a group of users [08:48:15] (03PS1) 10Ayounsi: Add ping3001 to site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/564873 (https://phabricator.wikimedia.org/T190090) [08:48:15] it would definitely help [08:48:37] it is also true that now the mem limits for each slice are 90% of the total ram [08:48:48] so very high, I may need to go down to 70 [08:50:01] (03CR) 10Ema: [C: 03+2] prometheus: collect varnishd_mmap_count for varnish-frontend [puppet] - 10https://gerrit.wikimedia.org/r/564696 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [08:51:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1098:3316 and db1098:3317', diff saved to https://phabricator.wikimedia.org/P10160 and previous config saved to /var/cache/conftool/dbconfig/20200115-085145-marostegui.json [08:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:23] (03PS2) 10Ayounsi: Add ping3001 to site.pp, DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/564873 (https://phabricator.wikimedia.org/T190090) [08:52:39] (03PS1) 10Elukey: Apply tighter cgroup memory limits to Analytics client hosts [puppet] - 10https://gerrit.wikimedia.org/r/564874 [08:52:49] godog: --^ [08:54:13] (03CR) 10Ayounsi: [C: 03+2] Add ping3001 to site.pp, DHCP and netboot [puppet] - 10https://gerrit.wikimedia.org/r/564873 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [08:55:06] (03CR) 10Elukey: [C: 03+2] Apply tighter cgroup memory limits to Analytics client hosts [puppet] - 10https://gerrit.wikimedia.org/r/564874 (owner: 10Elukey) [08:56:13] (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine" [puppet] - 10https://gerrit.wikimedia.org/r/564726 (https://phabricator.wikimedia.org/T242411) (owner: 10Ema) [09:04:06] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:04:50] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:26] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:30] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:46] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:07:02] (03PS1) 10Ema: prometheus: fix varnishd_mmap_count pid extraction [puppet] - 10https://gerrit.wikimedia.org/r/564898 (https://phabricator.wikimedia.org/T242417) [09:07:20] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:08:38] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:08:42] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:10:03] elukey: ^ [09:11:14] !log Deploy schema change on x1 codfw - T242749 [09:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:18] T242749: Drop `echo_notification_user_hash_timestamp` index where it exists - https://phabricator.wikimedia.org/T242749 [09:11:42] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:11:58] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1006.eqiad.wmnet, druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:14:36] (03PS1) 10Muehlenhoff: Switch url-downloader.codfw to urldownloader2001 [dns] - 10https://gerrit.wikimedia.org/r/564899 (https://phabricator.wikimedia.org/T224551) [09:14:54] here I am [09:15:16] (03PS11) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [09:16:53] ok I know what's happening, druid is kinda locked up after segments deletion [09:19:01] (03CR) 10Ema: [C: 03+2] cache: add CAP_KILL to varnish-frontend capabilities [puppet] - 10https://gerrit.wikimedia.org/r/564726 (https://phabricator.wikimedia.org/T242411) (owner: 10Ema) [09:19:19] !log roll-restart druid brokers on druid100[4-6] - locked up after segments deletion [09:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:34] I grabbed a jstack to analyze later on [09:19:42] it has been a while since it didn't happen [09:19:43] sigh [09:19:50] should recover now [09:19:58] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:20:02] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:20:02] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:20:18] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:20:28] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:20:51] (03CR) 10Ayounsi: [C: 03+1] "+1 on the principle that it's already not working so can't break it more :)" [puppet] - 10https://gerrit.wikimedia.org/r/564046 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [09:21:14] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:21:26] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:21:28] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:23:20] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:23:30] (03PS1) 10Ayounsi: Ping offload, add esams text-lb VIP [puppet] - 10https://gerrit.wikimedia.org/r/564908 (https://phabricator.wikimedia.org/T190090) [09:23:36] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [09:25:04] (03CR) 10Ayounsi: [C: 03+2] Ping offload, add esams text-lb VIP [puppet] - 10https://gerrit.wikimedia.org/r/564908 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [09:26:42] (03PS12) 10Giuseppe Lavagetto: lvs::monitor: fix most technical debt [puppet] - 10https://gerrit.wikimedia.org/r/564690 [09:27:38] 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10SandraF_WMF) @jcrespo and @herron - just checking in, do you think this will work? Let me know if I can do anything to help. [09:32:21] !log Deploy schema change on x1 eqiad hosts T242749 [09:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:24] T242749: Drop `echo_notification_user_hash_timestamp` index where it exists - https://phabricator.wikimedia.org/T242749 [09:34:25] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for jennifer wang (jwang) - https://phabricator.wikimedia.org/T242807 (10MoritzMuehlenhoff) p:05Triage→03Normal [09:36:14] (03PS1) 10Ayounsi: Ping offload, add conf if ping_offload_redirect is set [homer/public] - 10https://gerrit.wikimedia.org/r/564914 [09:37:39] (03PS1) 10Ayounsi: Enable ping offload in esams [homer/public] - 10https://gerrit.wikimedia.org/r/564917 (https://phabricator.wikimedia.org/T190090) [09:40:53] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Ping offload, add conf if ping_offload_redirect is set [homer/public] - 10https://gerrit.wikimedia.org/r/564914 (owner: 10Ayounsi) [09:41:13] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Enable ping offload in esams [homer/public] - 10https://gerrit.wikimedia.org/r/564917 (https://phabricator.wikimedia.org/T190090) (owner: 10Ayounsi) [09:42:13] !log enable ping offload in esams - T190090 [09:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:16] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [09:43:55] (03PS1) 10Filippo Giunchedi: install_server: fix number of devices for raid10 recipes [puppet] - 10https://gerrit.wikimedia.org/r/564919 (https://phabricator.wikimedia.org/T156955) [09:46:55] !log depooling cp5012 for some ats parent select tests [09:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:49] (03PS2) 10Ema: prometheus: fix varnishd_mmap_count pid extraction [puppet] - 10https://gerrit.wikimedia.org/r/564898 (https://phabricator.wikimedia.org/T242417) [09:53:11] (03CR) 10Ema: [C: 03+2] prometheus: fix varnishd_mmap_count pid extraction [puppet] - 10https://gerrit.wikimedia.org/r/564898 (https://phabricator.wikimedia.org/T242417) (owner: 10Ema) [09:55:59] !log repooling cp5012 [09:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:18] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 (10ayounsi) 05Open→03Resolved Done, dashboard and doc updated. https://grafana.wikimedia.org/d/000000513/ping-offload [10:01:43] jouncebot: next [10:01:43] In 1 hour(s) and 58 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1200) [10:07:24] 10Operations, 10Traffic: ATS strict round robin parent select policy doesn't work as expected - https://phabricator.wikimedia.org/T242778 (10Vgutierrez) Apparently, when we disabled DNS resolution for parent requests to fix T232209 we introduced part of the issue. On short-lived ATS instances, enabling `proxy.... [10:08:18] !log cache: rolling varnish-frontend-restart to add CAP_KILL to varnish-frontend.service T242411 [10:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:22] T242411: varnish parent unable to send signals to child - https://phabricator.wikimedia.org/T242411 [10:08:37] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 56007856 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:10:13] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:15:22] (03PS1) 10Vgutierrez: ATS: Re-enable DNS resolution for ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/564929 (https://phabricator.wikimedia.org/T242778) [10:17:51] (03CR) 10Ema: [C: 03+1] "FFS" [puppet] - 10https://gerrit.wikimedia.org/r/564929 (https://phabricator.wikimedia.org/T242778) (owner: 10Vgutierrez) [10:18:00] ahahaha [10:19:33] (03CR) 10DCausse: [C: 04-1] "lgtm, minor comment on the commit message" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) (owner: 10Mstyles) [10:20:09] (03CR) 10Vgutierrez: [C: 03+2] ATS: Re-enable DNS resolution for ats-tls [puppet] - 10https://gerrit.wikimedia.org/r/564929 (https://phabricator.wikimedia.org/T242778) (owner: 10Vgutierrez) [10:23:20] 10Operations, 10Traffic: ATS strict round robin parent select policy doesn't work as expected - https://phabricator.wikimedia.org/T242778 (10Vgutierrez) before enabling DNS resolution on cp3052: ` vgutierrez@cp3052:~$ for port in {3120..3127}; do ss "( dport = $port or sport = $port )" |wc -l; done 3 144 4 13... [10:31:07] RECOVERY - Varnish frontend child restarted on cp3050 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3050&var-datasource=esams+prometheus/ops [10:38:23] !log installing openssl1.0 updates on stretch (update to 1.0.2u) [10:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:09] PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [10:47:29] this is me ---^ [10:48:29] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10akosiaris) >>! In T228924#5786002, @herron wrote: > The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few o... [10:50:47] the cp3050 varnish-fe recovery above is due to ongoing rolling restarts [10:50:47] RECOVERY - Oozie Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [10:58:48] (03PS2) 10Vgutierrez: Release 8.0.5-1wm12 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/564584 (https://phabricator.wikimedia.org/T242620) [10:59:01] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm12 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/564584 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [10:59:16] (03PS3) 10Ema: ATS: Deploy acme-chief version of unified certificate on text [puppet] - 10https://gerrit.wikimedia.org/r/561883 (https://phabricator.wikimedia.org/T234803) [11:00:53] (03CR) 10Ema: [C: 03+2] ATS: Deploy acme-chief version of unified certificate on text [puppet] - 10https://gerrit.wikimedia.org/r/561883 (https://phabricator.wikimedia.org/T234803) (owner: 10Ema) [11:07:41] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22636888 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:09:31] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 529968 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:09:45] !log added SonarQubeBot to "Non-Interactive Users" group on Gerrit [11:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:15] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 68655752 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:12:05] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 163872 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:14:17] (03CR) 10Muehlenhoff: [C: 03+1] "Doh!" [puppet] - 10https://gerrit.wikimedia.org/r/564919 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [11:15:11] (03CR) 10Marostegui: [C: 03+1] "Any objection to get this merged so we can proceed with the ES hosts installation?" [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [11:17:45] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10MoritzMuehlenhoff) [11:21:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] Pass down MAC address of to installing system via BOOTIF [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [11:23:59] PROBLEM - eventlogging Varnishkafka log producer on cp4031 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:24:05] PROBLEM - statsv Varnishkafka log producer on cp4031 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:24:35] PROBLEM - Webrequests Varnishkafka log producer on cp4031 is CRITICAL: PROCS CRITICAL: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:28:17] ema: roll restart related? --^ [11:28:32] (03PS2) 10Filippo Giunchedi: install_server: fix number of devices for raid10 recipes [puppet] - 10https://gerrit.wikimedia.org/r/564919 (https://phabricator.wikimedia.org/T156955) [11:28:34] (03PS1) 10Filippo Giunchedi: install_server: introduce raid0 standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/564959 (https://phabricator.wikimedia.org/T156955) [11:29:24] (03CR) 10Filippo Giunchedi: [C: 03+2] install_server: fix number of devices for raid10 recipes [puppet] - 10https://gerrit.wikimedia.org/r/564919 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [11:33:53] (03CR) 10Filippo Giunchedi: "Points I'm not 100% sure about:" [puppet] - 10https://gerrit.wikimedia.org/r/564959 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [11:34:10] (03PS3) 10Vgutierrez: Release 8.0.5-1wm12 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/564584 (https://phabricator.wikimedia.org/T242620) [11:34:21] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.5-1wm12 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/564584 (https://phabricator.wikimedia.org/T242620) (owner: 10Vgutierrez) [11:35:31] RECOVERY - Webrequests Varnishkafka log producer on cp4031 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:36:14] !log restart all varnishkafka daemons on cp4031 [11:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:34] errors were like [11:36:34] Jan 15 11:20:06 cp4031 varnishkafka[11294]: TERM: Received signal 15: terminating [11:36:37] Jan 15 11:20:06 cp4031 systemd[1]: Stopping VarnishKafka eventlogging... [11:36:40] Jan 15 11:20:06 cp4031 systemd[1]: Stopped VarnishKafka eventlogging. [11:36:43] Jan 15 11:20:57 cp4031 systemd[1]: Dependency failed for VarnishKafka eventlogging. [11:36:45] RECOVERY - eventlogging Varnishkafka log producer on cp4031 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:36:51] RECOVERY - statsv Varnishkafka log producer on cp4031 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [11:44:35] (03CR) 10Filippo Giunchedi: [C: 03+1] Pass down MAC address of to installing system via BOOTIF [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [11:58:18] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10MoritzMuehlenhoff) This install error is caused by the fact that these servers have a dual port NIC with 1G and 10G interfaces, but only t... [11:58:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1112', diff saved to https://phabricator.wikimedia.org/P10161 and previous config saved to /var/cache/conftool/dbconfig/20200115-115826-marostegui.json [11:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:43] (03CR) 10Muehlenhoff: "There's a side angle with potential impact to Stretch: I made a more detailed writeup at https://phabricator.wikimedia.org/T242481#5804823" [puppet] - 10https://gerrit.wikimedia.org/r/564729 (https://phabricator.wikimedia.org/T242481) (owner: 10Muehlenhoff) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1200). [12:00:04] awight: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:22] I can deploy my patch :-) [12:01:13] go for it! [12:01:54] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: d-i fails to install on servers with BRCM 2P 1G BT + 2P 10G SFP NDC - https://phabricator.wikimedia.org/T242481 (10Marostegui) I like option 3 indeed. There is one more option which is a bit more painful, which is to live hack pxelinux and set the MAC... [12:03:10] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Reject RPKI invalids on both transit and peering links [homer/public] - 10https://gerrit.wikimedia.org/r/563824 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [12:06:11] !log reject RPKI invalids in ulsfo - T220669 [12:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:14] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [12:11:02] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22835520 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:12:42] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 181056 and 40 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:17:45] (03CR) 10Muehlenhoff: "+1 on refactoring towards a multi snippet layout, e.g. elastic* nodes only have two disks." [puppet] - 10https://gerrit.wikimedia.org/r/564959 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [12:18:26] RECOVERY - Varnish frontend child restarted on cp3054 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3054&var-datasource=esams+prometheus/ops [12:19:10] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:22:49] !log upgrading ats on cp4026, cp4032, cp5006 and cp5012 - T242778 T242620 [12:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:55] T242620: ats-tls is having issues when varnish-fe goes away - https://phabricator.wikimedia.org/T242620 [12:22:56] T242778: ATS strict round robin parent select policy doesn't work as expected - https://phabricator.wikimedia.org/T242778 [12:24:53] !log awight@deploy1001 Synchronized php-1.35.0-wmf.14/extensions/Cite: SWAT: [[gerrit:564002|Don't fail with a LogicException during section preview (T242434)]] (duration: 01m 10s) [12:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:56] T242434: ReferenceStack.php: Unknown ref "BLUE" in group "" - https://phabricator.wikimedia.org/T242434 [12:27:12] !log EU SWAT done [12:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:08] (03CR) 10Vgutierrez: [C: 03+2] Pool ulsfo for ncredir service [dns] - 10https://gerrit.wikimedia.org/r/564627 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [12:35:13] (03PS2) 10Vgutierrez: Pool ulsfo for ncredir service [dns] - 10https://gerrit.wikimedia.org/r/564627 (https://phabricator.wikimedia.org/T242321) [12:38:52] !log Pooling ulsfo for ncredir service - T242321 [12:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:55] T242321: Provide non-canonical-redirect service from every datacenter - https://phabricator.wikimedia.org/T242321 [12:44:14] (03CR) 10Vgutierrez: [C: 03+2] Add ncredir500[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/564655 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [12:44:29] (03PS2) 10Vgutierrez: Add ncredir500[12] DNS records [dns] - 10https://gerrit.wikimedia.org/r/564655 (https://phabricator.wikimedia.org/T242321) [12:50:24] !log reject RPKI invalids in eqsin - T220669 [12:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:27] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [12:54:28] (03PS2) 10Muehlenhoff: Switch url-downloader.codfw to urldownloader2001 [dns] - 10https://gerrit.wikimedia.org/r/564899 (https://phabricator.wikimedia.org/T224551) [12:56:39] <_joe_> jouncebot: next [12:56:39] In 0 hour(s) and 3 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1300) [12:56:49] <_joe_> ok, a good time for a gerrit restart [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1300) [13:00:36] (03CR) 10Muehlenhoff: [C: 03+2] Switch url-downloader.codfw to urldownloader2001 [dns] - 10https://gerrit.wikimedia.org/r/564899 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [13:01:46] (03PS1) 10Giuseppe Lavagetto: This is a test, please disregard [puppet] - 10https://gerrit.wikimedia.org/r/564994 (https://phabricator.wikimedia.org/T97972) [13:02:41] <_joe_> !log restarting gerrit [13:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:34] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:05:59] (03PS2) 10Giuseppe Lavagetto: This is a test, please disregard. [puppet] - 10https://gerrit.wikimedia.org/r/564994 (https://phabricator.wikimedia.org/T97972) [13:09:06] (03Abandoned) 10Giuseppe Lavagetto: This is a test, please disregard. [puppet] - 10https://gerrit.wikimedia.org/r/564994 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [13:10:54] <_joe_> the appservers latency is due to some memcache pressure [13:10:58] <_joe_> cc effie elukey [13:11:20] should we take a further look? [13:11:41] <_joe_> no, we should hurry with the gutter pool I guess :P [13:11:55] (03PS3) 10Muehlenhoff: Switch conf/codfw and notebook* servers to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/564550 (https://phabricator.wikimedia.org/T156955) [13:13:03] (03CR) 10Muehlenhoff: [C: 03+2] Switch conf/codfw and notebook* servers to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/564550 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [13:14:49] well, requests have increased a bit as well [13:14:54] so this is related as well [13:15:29] (03PS1) 10Ema: systemd: allow setting global accounting options [puppet] - 10https://gerrit.wikimedia.org/r/565006 (https://phabricator.wikimedia.org/T183146) [13:16:33] jouncebot: next [13:16:33] In 0 hour(s) and 43 minute(s): Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1400) [13:17:26] (03CR) 10jerkins-bot: [V: 04-1] systemd: allow setting global accounting options [puppet] - 10https://gerrit.wikimedia.org/r/565006 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [13:21:09] _joe_ I don't see any particularly bad trend in mcrouter's latency on appservers, is there anything else that you saw? [13:21:40] (trying to understand the issue) [13:23:24] <_joe_> elukey: the latency of mcrouter went from 1 ms to 3 ms ("worst average reponse time") [13:23:43] <_joe_> adn correspondingly the req/s to mcrouter increased [13:24:05] <_joe_> and correspondingly so did the latency in the responses to requests from the users [13:24:31] <_joe_> it's interesting to note the trend is only on the appservers [13:24:35] <_joe_> not the api [13:25:24] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:28:37] (03CR) 10Giuseppe Lavagetto: "Another small correction." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [13:28:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:29:38] _joe_ sure ok but something like that shouldn't cause a ton of latency added to the final php response time no? It is noticeable but tolerable in theory [13:32:02] I 'll do the citoid/zotero dance for fixing them [13:32:13] thx [13:32:51] ah ok now I see the graph that you mentioned, in the RED dashboard [13:34:48] from memcache/mcrouter I don't see any big increase in traffic, but there is for appservers [13:34:52] mmm [13:35:16] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:36:06] but this is a ~200ms jump of php render latency --^ [13:37:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:38:02] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:39:09] (03PS1) 10MSantos: increase spacing between osm replication [puppet] - 10https://gerrit.wikimedia.org/r/565013 [13:39:15] let me know when things are back to normal so I can resume my rpki rollout [13:41:16] (03PS2) 10MSantos: increase spacing between osm replication [puppet] - 10https://gerrit.wikimedia.org/r/565013 [13:41:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:41:59] (03PS2) 10Ema: systemd: allow setting global accounting options [puppet] - 10https://gerrit.wikimedia.org/r/565006 (https://phabricator.wikimedia.org/T183146) [13:43:07] (03PS1) 10Alexandros Kosiaris: calico: Add new urldownloader [deployment-charts] - 10https://gerrit.wikimedia.org/r/565015 (https://phabricator.wikimedia.org/T224551) [13:43:34] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Ensure Zotero is working) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:44:54] 10Operations, 10ORES, 10Scoring-platform-team: ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Halfak) Looks like celery has shut down on all of the workers. I'm looking into it now. I think we might be too close to the memory ceiling and an OOM is what's killing them.... [13:45:58] (03PS2) 10Filippo Giunchedi: install_server: introduce raid0 standard partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/564959 (https://phabricator.wikimedia.org/T156955) [13:46:06] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/564959 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [13:47:20] (03CR) 10Alexandros Kosiaris: nodejs10: Add buster image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [13:49:04] (03PS1) 10Alexandros Kosiaris: calico: Remove all urldownloader IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/565016 (https://phabricator.wikimedia.org/T224551) [13:50:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "we can amend later I guess!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [13:50:48] (03PS2) 10Alexandros Kosiaris: calico: Add new urldownloaders [deployment-charts] - 10https://gerrit.wikimedia.org/r/565015 (https://phabricator.wikimedia.org/T224551) [13:50:50] (03PS2) 10Alexandros Kosiaris: calico: Remove all urldownloader IPs [deployment-charts] - 10https://gerrit.wikimedia.org/r/565016 (https://phabricator.wikimedia.org/T224551) [13:51:05] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10fgiunchedi) Thank you @Papaul and @Dzahn ! [13:51:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] calico: Add new urldownloaders [deployment-charts] - 10https://gerrit.wikimedia.org/r/565015 (https://phabricator.wikimedia.org/T224551) (owner: 10Alexandros Kosiaris) [13:51:32] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:52:36] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:59] effie, _joe_ I'd check if there is a traffic pattern causing this, it seems to me that memcached can't cause a 200/300ms increase in php latency like that [13:53:22] !log update calico policy on eqiad/codfw/staging. Add new urldownloaders. T224551 [13:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:35] T224551: Migrate URL downloaders to Buster - https://phabricator.wikimedia.org/T224551 [13:54:12] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:28] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:54:39] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [13:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:44] heh, that was fast [13:55:24] 10Operations, 10ORES, 10Scoring-platform-team: ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Halfak) Something really strange is going on. I cut our celery workers in half an we're still not able to actually start up celery because we get a MemoryError during the startu... [13:56:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/563185 (https://phabricator.wikimedia.org/T237911) (owner: 10Alexandros Kosiaris) [13:56:20] there are some appservers running a bit hot [13:56:32] from the cpu usage in https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All [13:56:32] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [13:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:44] (03PS1) 10Ema: cache: enable systemd resources accounting on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/565019 (https://phabricator.wikimedia.org/T183146) [13:57:01] jouncebot: next [13:57:01] In 0 hour(s) and 2 minute(s): Mediawiki train - European+American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1400) [13:57:36] so some mw have less than 25% of CPU usage, some others 75% [13:57:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:57:53] that doesn't make a lot of sense [13:58:45] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:43] (03PS2) 10Ema: cache: enable systemd resources accounting on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/565019 (https://phabricator.wikimedia.org/T183146) [14:00:03] ok yes some hosts with weight=20 are the ones with the low cpu usage [14:00:04] liw and brennen: How many deployers does it take to do Mediawiki train - European+American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1400). [14:00:11] effie, _joe_ can I tune some weights? [14:01:03] elukey: we did that a while ago [14:01:14] we will replace quite some servers soon anyway [14:01:29] sure but the latency is really high now [14:01:46] mw132[3,4] for example could be moved to 30 no? [14:02:04] let me see how old they are [14:02:16] (03PS1) 10Lars Wirzenius: group1 wikis to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565020 [14:02:18] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565020 (owner: 10Lars Wirzenius) [14:02:42] oh those are the leased ones [14:03:45] mw1253-8 are also are ~25% of cpu usage [14:03:51] the others are at 50% [14:04:05] elukey: there are actually 4 servers that are increasing the latency [14:04:13] mw1238 [14:04:22] mw1240 mw1246 [14:04:25] 40 41 46 [14:04:28] yes correct [14:04:36] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565020 (owner: 10Lars Wirzenius) [14:04:41] and 1242, although that is now ok [14:04:42] (03PS3) 10Ema: cache: enable systemd resources accounting on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/565019 (https://phabricator.wikimedia.org/T183146) [14:04:43] what is the bandaid usually? restart? [14:04:59] I actually would like to depool them and see how that works out [14:05:08] mw1238 is usually the top server I see [14:05:17] all of those are to be decommed [14:05:42] if you are up for the experiment, I can do it [14:05:51] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.15 [14:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] effie: sure, what do you want to do exactly? [14:06:05] (just to follow) [14:06:10] I want to depool those 3 servers [14:06:59] !log liw@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.15 (duration: 01m 07s) [14:07:00] and we can lower their weight if that does not work out well [14:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:55] effie: if you think it is good please do it, I am currently more on the line that there are a lot of hosts doing half of the work [14:08:05] sure [14:08:16] other ones are around the 70/75% mark as well [14:08:28] !log depool mw1238, mw1240, mw1246 [14:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:52] liw: please LMK once the train is done, thanks! [14:09:29] godog, it's done unless something explodes in group0 or group1 (group1 promotion just finished) [14:09:57] liw: ack, thanks [14:11:12] (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: format log consumer stdout as cee+json [puppet] - 10https://gerrit.wikimedia.org/r/563430 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [14:12:27] effie: first datapoints are looking good, let's see [14:14:11] elukey: nah the slower servers are slowing us down again [14:14:37] better than before [14:14:55] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) [14:15:12] yes definitely, p95 is around a sec now [14:16:22] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Core Platform Team Legacy (Watching / External), and 2 others: Move Graphoid to Kubernetes via the deployment pipeline - https://phabricator.wikimedia.org/T203091 (10akosiaris) 05Stalled→03Declined Per T211881 graphoid is being undepl... [14:17:25] elukey: I am off to lunch, I will play with the weights a bit afterwards [14:18:02] ack [14:18:04] !log reenable puppet on cp hosts, after https://gerrit.wikimedia.org/r/c/operations/puppet/+/563430 deployment [14:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:29] elukey: how is the situation? should I hold my (unrelated) changes? [14:21:59] 10Operations, 10ops-codfw, 10Wikimedia-Logstash, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet - https://phabricator.wikimedia.org/T240882 (10fgiunchedi) a:05fgiunchedi→03herron [14:22:10] XioNoX: we have some extra latency [14:23:09] liw: noting a couple of errors at a low volume for wmf.15 [14:24:03] XioNoX: I'd say no, nothing on fire yet [14:24:05] 10Operations, 10Core Platform Team, 10Graphoid, 10serviceops: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) [14:24:13] 10Operations, 10Core Platform Team, 10Graphoid, 10serviceops: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) p:05Triage→03Normal [14:24:23] I appreciate the yet :) [14:25:03] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:25:17] !log reject RPKI invalids in ams - T220669 [14:25:18] brennen, aye [14:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:21] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [14:26:50] liw: filing issues for a few things. [14:27:28] (ah, you beat me to it.) [14:28:24] brennen, better two issues filed than none [14:30:00] liw: train is done? I need to restart PHP-FPM on mw1261-mw1265 [14:30:19] moritzm, yes [14:30:29] thx [14:30:41] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:30:50] !log rolling restart of FPM on mw1261-mw1265 to pick up OpenSSL security update [14:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:06] (03PS3) 10Ema: systemd: allow setting global accounting options [puppet] - 10https://gerrit.wikimedia.org/r/565006 (https://phabricator.wikimedia.org/T183146) [14:31:09] (03PS4) 10Ema: cache: enable systemd resources accounting on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/565019 (https://phabricator.wikimedia.org/T183146) [14:31:24] (03PS1) 10CDanis: fnm: bump pps threshold up a bit [puppet] - 10https://gerrit.wikimedia.org/r/565025 [14:31:31] (03PS1) 10Vgutierrez: install_server,ncredir: Install ncredir500[12] [puppet] - 10https://gerrit.wikimedia.org/r/565026 (https://phabricator.wikimedia.org/T242321) [14:32:42] (03CR) 10Ayounsi: [C: 03+1] fnm: bump pps threshold up a bit [puppet] - 10https://gerrit.wikimedia.org/r/565025 (owner: 10CDanis) [14:35:11] (03PS4) 10Ema: systemd: allow setting global accounting options [puppet] - 10https://gerrit.wikimedia.org/r/565006 (https://phabricator.wikimedia.org/T183146) [14:35:13] (03PS5) 10Ema: cache: enable systemd resources accounting on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/565019 (https://phabricator.wikimedia.org/T183146) [14:43:07] (03PS1) 10Elukey: aptrepo: add the bigtop14 component to wikimedia-stretch [puppet] - 10https://gerrit.wikimedia.org/r/565027 [14:44:00] !log reject RPKI invalids in dfw - T220669 [14:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:03] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [14:44:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/565006 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [14:45:05] (03CR) 10Muehlenhoff: [C: 03+1] "Let's give it a shot, curious to see the performance impact!" [puppet] - 10https://gerrit.wikimedia.org/r/565019 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [14:46:48] (03CR) 10CDanis: [C: 03+2] fnm: bump pps threshold up a bit [puppet] - 10https://gerrit.wikimedia.org/r/565025 (owner: 10CDanis) [14:48:26] (03PS6) 10Ema: cache: enable systemd resources accounting on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/565019 (https://phabricator.wikimedia.org/T183146) [14:48:31] (03PS1) 10Ema: cache: add systemd config to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/565032 (https://phabricator.wikimedia.org/T183146) [14:49:21] (03PS1) 10Muehlenhoff: Switch url-downloader.eqiad to urldownloader1001 [dns] - 10https://gerrit.wikimedia.org/r/565033 (https://phabricator.wikimedia.org/T224551) [14:50:36] (03PS4) 10Filippo Giunchedi: varnish: use syslog for varnishlog consumers [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) [14:52:15] (03CR) 10Vgutierrez: [C: 03+1] systemd: allow setting global accounting options [puppet] - 10https://gerrit.wikimedia.org/r/565006 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [14:52:30] (03CR) 10Vgutierrez: [C: 03+1] cache: add systemd config to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/565032 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [14:53:33] !log pool mw1238, mw1240, mw1246 [14:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:01] !log lower weights on slower servers mw1238-mw1252 [14:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:29] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1238.eqiad.wmnet [14:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:30] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1239.eqiad.wmnet [14:54:32] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1240.eqiad.wmnet [14:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:32] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1241.eqiad.wmnet [14:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:34] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1242.eqiad.wmnet [14:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:35] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1243.eqiad.wmnet [14:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:36] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1244.eqiad.wmnet [14:54:37] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1245.eqiad.wmnet [14:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:38] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1246.eqiad.wmnet [14:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:39] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1247.eqiad.wmnet [14:54:40] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1248.eqiad.wmnet [14:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:41] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1249.eqiad.wmnet [14:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:42] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1250.eqiad.wmnet [14:54:44] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1251.eqiad.wmnet [14:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:45] !log jiji@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,cluster=appserver,service=nginx,name=mw1252.eqiad.wmnet [14:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:52] effie: FYI you can use regexs [14:54:57] in a single command [14:55:12] ;) [14:55:18] volans: :) [14:56:11] (03CR) 10Ema: [C: 03+2] systemd: allow setting global accounting options [puppet] - 10https://gerrit.wikimedia.org/r/565006 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [14:56:54] (03CR) 10Filippo Giunchedi: "This is ready to go, PTAL!" [puppet] - 10https://gerrit.wikimedia.org/r/563977 (https://phabricator.wikimedia.org/T227108) (owner: 10Filippo Giunchedi) [14:58:36] (03PS2) 10Filippo Giunchedi: prometheus: bump 'global' retention to 2.25 years [puppet] - 10https://gerrit.wikimedia.org/r/564679 [14:58:38] (03PS2) 10Filippo Giunchedi: prometheus: bump 'ops' retention to 4.5 months [puppet] - 10https://gerrit.wikimedia.org/r/564680 [15:00:14] (03CR) 10Filippo Giunchedi: prometheus: bump 'global' retention to 2.25 years (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564679 (owner: 10Filippo Giunchedi) [15:00:37] (03CR) 10jerkins-bot: [V: 04-1] prometheus: bump 'global' retention to 2.25 years [puppet] - 10https://gerrit.wikimedia.org/r/564679 (owner: 10Filippo Giunchedi) [15:00:57] (03PS3) 10Filippo Giunchedi: prometheus: bump 'global' retention to 2.25 years [puppet] - 10https://gerrit.wikimedia.org/r/564679 [15:00:59] (03PS3) 10Filippo Giunchedi: prometheus: bump 'ops' retention to 4.5 months [puppet] - 10https://gerrit.wikimedia.org/r/564680 [15:01:36] (03PS1) 10Jgreen: switch nsca_frack_cfg.erb monitoring frdb2001->frdb2002 and alnitak->frban2001 [puppet] - 10https://gerrit.wikimedia.org/r/565035 [15:02:00] !log installing OpenSSL security updates on mw* [15:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:13] (03CR) 10Ema: [C: 03+2] cache: add systemd config to profile::cache::base [puppet] - 10https://gerrit.wikimedia.org/r/565032 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [15:03:28] (03CR) 10Jgreen: [C: 03+2] switch nsca_frack_cfg.erb monitoring frdb2001->frdb2002 and alnitak->frban2001 [puppet] - 10https://gerrit.wikimedia.org/r/565035 (owner: 10Jgreen) [15:05:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch url-downloader.eqiad to urldownloader1001 [dns] - 10https://gerrit.wikimedia.org/r/565033 (https://phabricator.wikimedia.org/T224551) (owner: 10Muehlenhoff) [15:05:40] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) [15:07:00] (03CR) 10Volans: [C: 03+2] dns: include all IP addresses with FQDN [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561601 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:07:03] (03CR) 10Volans: [C: 03+2] dns: generate correct zone name in all cases [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561602 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:07:11] (03CR) 10Volans: [C: 03+2] dns: sort records by the rightmost part [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561603 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:07:24] (03CR) 10Volans: [C: 03+2] dns: manage also devices in Inventory state [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561917 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:07:31] (03CR) 10Volans: [C: 03+2] dns: manage separately servers from other devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/561918 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [15:07:47] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) [15:08:11] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Jgreen) [15:08:21] (03PS2) 10Volans: eqiad: add missing mgmt asset tag records [dns] - 10https://gerrit.wikimedia.org/r/561856 (https://phabricator.wikimedia.org/T239597) [15:09:17] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18137512 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:09:17] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 480576872 and 27 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:11:07] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 120 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:11:07] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 81672 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:13:15] (03CR) 10Ema: [C: 03+2] cache: enable systemd resources accounting on cp4026 [puppet] - 10https://gerrit.wikimedia.org/r/565019 (https://phabricator.wikimedia.org/T183146) (owner: 10Ema) [15:14:23] (03PS3) 10Volans: Fix network devices management records [dns] - 10https://gerrit.wikimedia.org/r/561857 (https://phabricator.wikimedia.org/T239597) [15:14:25] I'm deploying two train blockers: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseQualityConstraints/+/565012 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/565034 [15:15:29] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:15:48] (03CR) 10Vgutierrez: [C: 03+2] install_server,ncredir: Install ncredir500[12] [puppet] - 10https://gerrit.wikimedia.org/r/565026 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [15:16:32] (03CR) 10Volans: [C: 03+2] eqiad: add missing mgmt asset tag records [dns] - 10https://gerrit.wikimedia.org/r/561856 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [15:16:44] (03CR) 10Volans: [C: 03+2] Fix network devices management records [dns] - 10https://gerrit.wikimedia.org/r/561857 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [15:17:04] (03CR) 10Ottomata: [C: 03+1] aptrepo: add the bigtop14 component to wikimedia-stretch [puppet] - 10https://gerrit.wikimedia.org/r/565027 (owner: 10Elukey) [15:19:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/565027 (owner: 10Elukey) [15:20:07] !log installing OpenSSL security updates on db* hosts [15:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:12] 10Operations, 10Quality-and-Test-Engineering-Team (QTE), 10Wikimedia-Mailing-lists, 10User-zeljkofilipin: Close QA mailing list - https://phabricator.wikimedia.org/T237383 (10zeljkofilipin) [15:21:17] 10Operations, 10ops-eqiad, 10serviceops: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [15:22:05] (03PS1) 10Effie Mouzeli: varnish: Add CInetHttp/1.0 to bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/565038 [15:22:15] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [dns] - 10https://gerrit.wikimedia.org/r/561925 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [15:22:29] (03PS2) 10Ema: ATS: remove X-Analytics from responses sent to users [puppet] - 10https://gerrit.wikimedia.org/r/559711 (https://phabricator.wikimedia.org/T196558) [15:23:51] (03PS1) 10CDanis: puppet-merge: clean up output a bit [puppet] - 10https://gerrit.wikimedia.org/r/565039 [15:24:49] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump 'global' retention to 2.25 years [puppet] - 10https://gerrit.wikimedia.org/r/564679 (owner: 10Filippo Giunchedi) [15:26:04] (03CR) 10Ema: [C: 03+2] ATS: remove X-Analytics from responses sent to users [puppet] - 10https://gerrit.wikimedia.org/r/559711 (https://phabricator.wikimedia.org/T196558) (owner: 10Ema) [15:26:15] (03CR) 10Volans: [C: 03+1] "LGTM, I don't mind the colors but other might ;)" [puppet] - 10https://gerrit.wikimedia.org/r/565039 (owner: 10CDanis) [15:26:34] (03CR) 10CDanis: [C: 03+2] puppet-merge: clean up output a bit [puppet] - 10https://gerrit.wikimedia.org/r/565039 (owner: 10CDanis) [15:28:52] !log cp3064: ats-tls-restart to apply https://gerrit.wikimedia.org/r/559711 T196558 [15:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:55] T196558: Send X-Analytics information from Varnish to Hadoop with VCL_Log - https://phabricator.wikimedia.org/T196558 [15:29:27] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:30:02] (03CR) 10CDanis: [C: 03+1] varnish: Add CInetHttp/1.0 to bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/565038 (owner: 10Effie Mouzeli) [15:30:43] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10Jgreen) 05Open→03Resolved [15:32:54] (03CR) 10Effie Mouzeli: [C: 03+2] varnish: Add CInetHttp/1.0 to bot_blocked_nets [puppet] - 10https://gerrit.wikimedia.org/r/565038 (owner: 10Effie Mouzeli) [15:33:46] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9904 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [15:33:54] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T239733 (10Jgreen) 05Open→03Resolved [15:37:28] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9904 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [15:37:51] !log rolling restart of ats-tls instances - T196558 T242778 [15:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:55] T242778: ATS strict round robin parent select policy doesn't work as expected - https://phabricator.wikimedia.org/T242778 [15:37:55] T196558: Send X-Analytics information from Varnish to Hadoop with VCL_Log - https://phabricator.wikimedia.org/T196558 [15:40:06] (03PS1) 10Cmjohnson: Adding all dns entries for mc-gp100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/565041 (https://phabricator.wikimedia.org/T241795) [15:40:47] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.15/extensions/Wikibase/client/includes/Api/PageTerms.php: [[gerrit:565034|Fix invalid iteration over false in PageTerms (T242856)]] (duration: 01m 06s) [15:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:50] T242856: PHP Warning: Invalid argument supplied for foreach() - https://phabricator.wikimedia.org/T242856 [15:44:55] (03PS2) 10Elukey: aptrepo: add the bigtop14 component to wikimedia-stretch [puppet] - 10https://gerrit.wikimedia.org/r/565027 [15:46:06] (03PS1) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: modest refactor [puppet] - 10https://gerrit.wikimedia.org/r/565043 (https://phabricator.wikimedia.org/T238766) [15:46:08] (03PS1) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: retry if we encounter an exception [puppet] - 10https://gerrit.wikimedia.org/r/565044 (https://phabricator.wikimedia.org/T238766) [15:46:55] (03CR) 10jerkins-bot: [V: 04-1] wmcs-dns-floating-ip-updater.py: modest refactor [puppet] - 10https://gerrit.wikimedia.org/r/565043 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [15:47:15] (03CR) 10jerkins-bot: [V: 04-1] wmcs-dns-floating-ip-updater.py: retry if we encounter an exception [puppet] - 10https://gerrit.wikimedia.org/r/565044 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [15:48:12] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9904 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [15:49:17] (03PS2) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: modest refactor [puppet] - 10https://gerrit.wikimedia.org/r/565043 (https://phabricator.wikimedia.org/T238766) [15:49:19] (03PS2) 10Andrew Bogott: wmcs-dns-floating-ip-updater.py: retry if we encounter an exception [puppet] - 10https://gerrit.wikimedia.org/r/565044 (https://phabricator.wikimedia.org/T238766) [15:51:13] the prometheus restarts are expected [15:54:59] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.15/extensions/WikibaseQualityConstraints/extension.json: [[gerrit:565012|Fix service injection for special page (T242846)]] (duration: 01m 08s) [15:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:02] T242846: Argument 4 passed to WikibaseQuality\ConstraintReport\Specials\SpecialConstraintReport::newFromGlobalState() must be an instance of WikibaseQuality\ConstraintReport\ConstraintCheck\DelegatingConstraintChecker, instance of WikibaseQuality\ConstraintReport\Api\CachingResultsSource given, called in /srv/mediawiki/php-1.35.0-wmf.15/vendor/wikimedia/object-factory/src/ObjectFactory.php on line 172 - https://phabricator.wikimedia [15:55:08] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9904 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [15:55:30] liw: I'm done, the train should be unblocked on us [15:56:03] Amir1, thank you [15:56:38] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [15:56:53] 10Operations, 10Maps: OSM Replication failed at eqiad and codfw - https://phabricator.wikimedia.org/T237228 (10MSantos) 05Open→03Resolved a:03MSantos [15:57:26] (03PS1) 10Papaul: DNS: Add mgmt and producion DNS for restbase202[1-3] [dns] - 10https://gerrit.wikimedia.org/r/565048 [15:58:45] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (ASAP) rack/setup/install frban1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10Jgreen) [15:59:04] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:01:42] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [16:07:44] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:08:03] (03PS1) 10BBlack: GeoDNS: Define alternate esams depooling method [dns] - 10https://gerrit.wikimedia.org/r/565049 [16:11:50] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:15:02] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:16:46] (03PS2) 10Muehlenhoff: Switch Thumbor to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563432 [16:20:03] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/20369/" [puppet] - 10https://gerrit.wikimedia.org/r/563432 (owner: 10Muehlenhoff) [16:21:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch Thumbor to apt::package_from_component [puppet] - 10https://gerrit.wikimedia.org/r/563432 (owner: 10Muehlenhoff) [16:21:27] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [16:22:33] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 2 others: LDF server has 404 errors for JS and CSS resources - https://phabricator.wikimedia.org/T237165 (10Gehel) 05Open→03Resolved [16:23:45] (03CR) 10Elukey: [C: 03+2] aptrepo: add the bigtop14 component to wikimedia-stretch [puppet] - 10https://gerrit.wikimedia.org/r/565027 (owner: 10Elukey) [16:27:09] !log import key 0xDBBF9D42B7B4BD70 (Apache BigTop) manually on install1002's gpg [16:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:35] (03CR) 10Filippo Giunchedi: [C: 03+1] DNS: Add mgmt and producion DNS for restbase202[1-3] [dns] - 10https://gerrit.wikimedia.org/r/565048 (owner: 10Papaul) [16:29:24] (03CR) 10Cwhite: [C: 03+2] mtail: track new subscription requests in prometheus [puppet] - 10https://gerrit.wikimedia.org/r/564129 (https://phabricator.wikimedia.org/T236505) (owner: 10Cwhite) [16:29:41] (03PS1) 10Papaul: DHCP: Add MAC address entires for restbase202[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/565057 (https://phabricator.wikimedia.org/T241790) [16:30:00] (03CR) 10Filippo Giunchedi: [C: 03+2] DNS: Add mgmt and producion DNS for restbase202[1-3] [dns] - 10https://gerrit.wikimedia.org/r/565048 (owner: 10Papaul) [16:30:28] 10Operations, 10hardware-requests: Expand Eqiad Ganeti row_A capacity - https://phabricator.wikimedia.org/T242885 (10herron) [16:31:38] (03CR) 10CDanis: [C: 03+1] "Looks good! One documentation nit." (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/565049 (owner: 10BBlack) [16:32:18] 10Operations, 10ops-codfw, 10Core Platform Team, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) [16:35:27] (03CR) 10Filippo Giunchedi: [C: 03+2] DHCP: Add MAC address entires for restbase202[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/565057 (https://phabricator.wikimedia.org/T241790) (owner: 10Papaul) [16:36:34] (03CR) 10CRusnov: [C: 03+1] "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/561925 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [16:36:39] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:37:06] (03PS2) 10Volans: mgmt: fix asset tags based on the physical label [dns] - 10https://gerrit.wikimedia.org/r/561925 (https://phabricator.wikimedia.org/T239597) [16:37:52] (03CR) 10Volans: [C: 03+2] mgmt: fix asset tags based on the physical label [dns] - 10https://gerrit.wikimedia.org/r/561925 (https://phabricator.wikimedia.org/T239597) (owner: 10Volans) [16:37:57] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 511 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:38:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Hardware asset tag Netbox/DNS mgmt inconsistencies - https://phabricator.wikimedia.org/T239597 (10Volans) 05Open→03Resolved [16:38:51] 10Operations, 10ORES, 10Scoring-platform-team: ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Halfak) a:03Halfak [16:39:16] 10Operations, 10ORES, 10Scoring-platform-team: Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) a:03Halfak [16:43:05] (03PS1) 10Papaul: Partman: Add restbase202[1-3] to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/565058 (https://phabricator.wikimedia.org/T241790) [16:43:41] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 511 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:43:42] (03CR) 10Volans: [C: 03+2] ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 (owner: 10Vgutierrez) [16:43:54] \o/ [16:44:04] vgutierrez: sorry for the delay :) [16:44:07] np [16:44:14] I need to tweak a bit the cookbook anyway [16:44:20] I've already installed on the ganeti instances I needed [16:44:24] sorry [16:44:36] but hopefully it would be useful for somebody else :) [16:48:00] (03Merged) 10jenkins-bot: ganeti: Add esams, ulsfo and eqsin clusters and rows [software/spicerack] - 10https://gerrit.wikimedia.org/r/563132 (owner: 10Vgutierrez) [16:52:39] (03CR) 10Filippo Giunchedi: [C: 03+2] Partman: Add restbase202[1-3] to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/565058 (https://phabricator.wikimedia.org/T241790) (owner: 10Papaul) [16:52:53] (03PS1) 10CDanis: ripeatlas alerts: link to the grafana dashboard too [puppet] - 10https://gerrit.wikimedia.org/r/565060 [16:53:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:53:34] (03CR) 10jerkins-bot: [V: 04-1] ripeatlas alerts: link to the grafana dashboard too [puppet] - 10https://gerrit.wikimedia.org/r/565060 (owner: 10CDanis) [16:53:45] 10Operations, 10ops-codfw, 10Core Platform Team, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-private1-b-codfw] member ge-5/0/1 { ...... [16:55:03] 10Operations, 10ops-codfw, 10Core Platform Team, 10Patch-For-Review: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) [16:55:12] (03PS2) 10CDanis: ripeatlas alerts: link to the grafana dashboard too [puppet] - 10https://gerrit.wikimedia.org/r/565060 [16:56:03] (03PS1) 10Volans: ganeti: add support to PoPs DCs [puppet] - 10https://gerrit.wikimedia.org/r/565061 (https://phabricator.wikimedia.org/T242828) [16:57:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:57:33] (03CR) 10CDanis: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/20370/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/565060 (owner: 10CDanis) [17:00:24] (03PS1) 10Vgutierrez: Add ncredir-lb.eqsin.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/565062 (https://phabricator.wikimedia.org/T242321) [17:07:35] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 115082576 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:07:46] (03CR) 10Arturo Borrero Gonzalez: "personally I find it difficult to review big patches like this one." [puppet] - 10https://gerrit.wikimedia.org/r/565043 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [17:08:55] 10Operations: ProdPasteBot uses deprecated certificate auth - https://phabricator.wikimedia.org/T242857 (10Aklapper) Thanks, that was the missing link to find out about https://phabricator.wikimedia.org/p/ProdPasteBot/ [17:09:07] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 8792 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:09:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/565044 (https://phabricator.wikimedia.org/T238766) (owner: 10Andrew Bogott) [17:09:13] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 69089904 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:09:33] (03CR) 10ArielGlenn: [C: 03+1] "good to see codfw put to more use ;-)" [dns] - 10https://gerrit.wikimedia.org/r/565049 (owner: 10BBlack) [17:10:49] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 7664 and 78 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:19:50] 10Operations: ProdPasteBot uses deprecated certificate auth - https://phabricator.wikimedia.org/T242857 (10Reedy) >>! In T242857#5806075, @Aklapper wrote: > Thanks, that was the missing link to find out about https://phabricator.wikimedia.org/p/ProdPasteBot/ Yeah, sorry. I didn't realise that the bot wasn't act... [17:21:30] (03PS1) 10Ladsgroup: labs: Set migration stage for properties to new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565064 (https://phabricator.wikimedia.org/T225054) [17:22:24] (03CR) 10Addshore: [C: 03+1] labs: Set migration stage for properties to new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565064 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [17:22:43] (03CR) 10jerkins-bot: [V: 04-1] labs: Set migration stage for properties to new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565064 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [17:24:14] 10Operations, 10ops-codfw, 10Core Platform Team: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) [17:26:09] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for frlog2001 [dns] - 10https://gerrit.wikimedia.org/r/565067 [17:26:49] (03PS2) 10Ladsgroup: labs: Set migration stage for properties to new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565064 (https://phabricator.wikimedia.org/T225054) [17:28:36] (03CR) 10Vgutierrez: [C: 03+2] Add ncredir-lb.eqsin.wikimedia.org DNS records [dns] - 10https://gerrit.wikimedia.org/r/565062 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [17:29:15] (03CR) 10Ladsgroup: "nooop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565064 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [17:29:49] (03CR) 10Addshore: [C: 03+1] labs: Set migration stage for properties to new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565064 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [17:34:31] (03PS1) 10Vgutierrez: lvs: Set realserver_ips on ncredir eqsin instances [puppet] - 10https://gerrit.wikimedia.org/r/565070 (https://phabricator.wikimedia.org/T242321) [17:34:36] (03CR) 10Ladsgroup: [C: 03+2] labs: Set migration stage for properties to new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565064 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [17:35:46] (03Merged) 10jenkins-bot: labs: Set migration stage for properties to new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565064 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [17:36:11] ^ rebased [17:36:11] (03CR) 10Dwisehaupt: [C: 03+1] "Looks good. shipit." [dns] - 10https://gerrit.wikimedia.org/r/565067 (owner: 10Papaul) [17:37:22] (03CR) 10Vgutierrez: [C: 03+2] lvs: Set realserver_ips on ncredir eqsin instances [puppet] - 10https://gerrit.wikimedia.org/r/565070 (https://phabricator.wikimedia.org/T242321) (owner: 10Vgutierrez) [17:38:52] (03PS1) 10Ladsgroup: Stop writing to wb_terms for properties in Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565073 (https://phabricator.wikimedia.org/T225054) [17:40:30] (03CR) 10Addshore: [C: 03+1] Stop writing to wb_terms for properties in Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565073 (https://phabricator.wikimedia.org/T225054) (owner: 10Ladsgroup) [17:42:47] (03PS1) 10Ladsgroup: Set read for items in Wikidata for new term store up to Q8M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565074 (https://phabricator.wikimedia.org/T225057) [17:50:34] !log anomie@deploy1001 Synchronized private/PrivateSettings.php: Setting RSA keys for OAuth 2.0 (T242872) (duration: 01m 05s) [17:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:23] (03PS1) 10BBlack: webserver-misc-static cert: add wikiworkshop.org [puppet] - 10https://gerrit.wikimedia.org/r/565078 (https://phabricator.wikimedia.org/T242374) [17:57:40] (03PS1) 10Anomie: Set OAuth 2 access token expiry to 'infinity' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565079 [17:58:28] (03PS1) 10BBlack: wikiworkshop.org: Add CAA for LE certs [dns] - 10https://gerrit.wikimedia.org/r/565080 (https://phabricator.wikimedia.org/T242374) [17:58:30] (03CR) 10BBlack: [C: 03+2] webserver-misc-static cert: add wikiworkshop.org [puppet] - 10https://gerrit.wikimedia.org/r/565078 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack) [18:00:03] (03CR) 10BBlack: [C: 03+2] wikiworkshop.org: Add CAA for LE certs [dns] - 10https://gerrit.wikimedia.org/r/565080 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack) [18:03:13] (03PS1) 10BBlack: wikiworkshop: define internal microsite setup [puppet] - 10https://gerrit.wikimedia.org/r/565081 (https://phabricator.wikimedia.org/T242374) [18:07:17] (03CR) 10Cicalese: [C: 03+1] Set OAuth 2 access token expiry to 'infinity' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565079 (owner: 10Anomie) [18:08:12] (03CR) 10Anomie: [C: 03+2] Set OAuth 2 access token expiry to 'infinity' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565079 (owner: 10Anomie) [18:08:23] (03CR) 10BBlack: [C: 03+2] wikiworkshop: define internal microsite setup [puppet] - 10https://gerrit.wikimedia.org/r/565081 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack) [18:09:05] (03Merged) 10jenkins-bot: Set OAuth 2 access token expiry to 'infinity' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565079 (owner: 10Anomie) [18:10:58] !log anomie@deploy1001 Synchronized wmf-config/CommonSettings.php: Set OAuth 2 access token expiry to "infinity" (duration: 01m 04s) [18:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:27] 10Operations, 10Phabricator, 10Traffic, 10serviceops, 10Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3)): Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10mmodell) p:05Normal→03High [18:22:12] (03PS1) 10BBlack: acmechief: define public wikiworkshop.org cert [puppet] - 10https://gerrit.wikimedia.org/r/565084 (https://phabricator.wikimedia.org/T242374) [18:22:57] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, 10User-brennen: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10brennen) [18:24:39] 10Operations, 10Scap, 10serviceops: Make canary wait time configurable - https://phabricator.wikimedia.org/T217924 (10jijiki) [18:25:09] (03PS1) 10BBlack: wikiworkshop: set up cache routing [puppet] - 10https://gerrit.wikimedia.org/r/565085 (https://phabricator.wikimedia.org/T242374) [18:25:20] (03CR) 10BBlack: [C: 03+2] acmechief: define public wikiworkshop.org cert [puppet] - 10https://gerrit.wikimedia.org/r/565084 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack) [18:25:53] 10Operations, 10Scap, 10serviceops: Make canary wait time configurable - https://phabricator.wikimedia.org/T217924 (10jijiki) @thcipriani as per our discussion, we can consider merging and testing first for syncing files and then on the train. How does that sound? [18:27:20] (03CR) 10BBlack: [C: 03+2] wikiworkshop: set up cache routing [puppet] - 10https://gerrit.wikimedia.org/r/565085 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack) [18:30:21] centralization :/ [18:30:45] ouch, somebody unplugged the wrong thing [18:34:02] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) The installation is failing with the message below raid1-2dev doesn't exist. It should be partman/raid1-dev.cfg I will update the netboot.cfg late... [18:40:21] (03PS1) 10BBlack: wikiworkshop: add to varnish allowed hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/565086 (https://phabricator.wikimedia.org/T242374) [18:40:23] (03PS1) 10Jforrester: .gitignore: Add Beta Cluster-specific items to clean up git there [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565087 (https://phabricator.wikimedia.org/T238595) [18:41:34] (03CR) 10BBlack: [C: 03+2] wikiworkshop: add to varnish allowed hosts as well [puppet] - 10https://gerrit.wikimedia.org/r/565086 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack) [18:56:56] (03PS1) 10BBlack: wikiworkshop: fixup for www redirect [puppet] - 10https://gerrit.wikimedia.org/r/565090 (https://phabricator.wikimedia.org/T242374) [18:57:32] PROBLEM - traffic_server backend process restarted on cp3063 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=esams+prometheus/ops&var-instance=cp3063&var-layer=backend [18:58:14] (03CR) 10Jforrester: [C: 03+2] .gitignore: Add Beta Cluster-specific items to clean up git there [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565087 (https://phabricator.wikimedia.org/T238595) (owner: 10Jforrester) [18:59:07] (03Merged) 10jenkins-bot: .gitignore: Add Beta Cluster-specific items to clean up git there [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565087 (https://phabricator.wikimedia.org/T238595) (owner: 10Jforrester) [19:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1900). [19:00:04] Tchanders: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:02:07] (03CR) 10BBlack: [C: 03+2] wikiworkshop: fixup for www redirect [puppet] - 10https://gerrit.wikimedia.org/r/565090 (https://phabricator.wikimedia.org/T242374) (owner: 10BBlack) [19:04:43] (03PS1) 10Volans: netbox: fix user passwords [puppet] - 10https://gerrit.wikimedia.org/r/565092 [19:07:40] (03PS2) 10Jforrester: Enable banner for wikis that recently opted in to partial blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [19:08:14] (03PS3) 10Jforrester: Deploy partial blocks on commons wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564097 (https://phabricator.wikimedia.org/T242570) (owner: 10DannyS712) [19:13:21] (03CR) 10Cmjohnson: [C: 03+2] Adding all dns entries for mc-gp100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/565041 (https://phabricator.wikimedia.org/T241795) (owner: 10Cmjohnson) [19:13:25] (03PS2) 10Cmjohnson: Adding all dns entries for mc-gp100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/565041 (https://phabricator.wikimedia.org/T241795) [19:13:32] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Adding all dns entries for mc-gp100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/565041 (https://phabricator.wikimedia.org/T241795) (owner: 10Cmjohnson) [19:13:37] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/565092 (owner: 10Volans) [19:13:58] (03CR) 10Volans: [C: 03+2] netbox: fix user passwords [puppet] - 10https://gerrit.wikimedia.org/r/565092 (owner: 10Volans) [19:14:55] 10Operations, 10ops-eqiad, 10serviceops, 10Patch-For-Review: (Need By: Jan 10) rack/setup/install mc-gp100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T241795 (10Cmjohnson) [19:25:39] (03PS1) 10Volans: netbox: fix user passwords (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/565095 [19:31:46] (03PS2) 10Volans: netbox: fix user passwords (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/565095 [19:32:55] 10Operations, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1039 - https://phabricator.wikimedia.org/T242511 (10Cmjohnson) AHS log sent to HPE as per their request [19:35:50] (03PS3) 10Volans: netbox: fix user passwords (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/565095 [19:37:13] (03CR) 10CRusnov: [C: 03+1] "As discussed and debugged, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/565095 (owner: 10Volans) [19:37:18] (03CR) 10Volans: "compiler results https://puppet-compiler.wmflabs.org/compiler1003/20373/netboxdb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/565095 (owner: 10Volans) [19:38:27] (03CR) 10Volans: [C: 03+2] netbox: fix user passwords (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/565095 (owner: 10Volans) [19:44:11] Is SWAT happening? [19:44:29] jouncebot: now [19:44:29] For the next 0 hour(s) and 15 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T1900) [19:44:47] * Krinkle checks if James_F|Busy is the real James :D [19:44:49] Hi [19:45:06] * Krinkle authenticates [19:46:22] mutante: would you be willing to do a gerrit admin job? [19:46:22] (03CR) 10Nuria: "Nice, thanks for doing this" [puppet] - 10https://gerrit.wikimedia.org/r/564874 (owner: 10Elukey) [19:46:48] this new gerrit privilege policy is a mess [19:46:53] and annoying [19:48:17] If nobody is operating SWAT, I'll take this moment to roll out a patch of mine. [19:48:23] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/MultimediaViewer/+/564811/ [19:48:25] hauskatze|dinner: on a ticket, sure [19:48:27] Krinkle: I was going to. [19:48:43] OK :) [19:48:44] Now that I can. [19:48:47] (03CR) 10Jforrester: [C: 03+2] Enable banner for wikis that recently opted in to partial blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [19:49:02] mutante: T241509 & thanks [19:49:05] T241509: +2 for Zoranzoki21 in mediawiki/extensions/GoogleAdSense - https://phabricator.wikimedia.org/T241509 [19:49:29] extension-GoogleAdSense is locked to Administrators so I can't add him [19:49:51] (03Merged) 10jenkins-bot: Enable banner for wikis that recently opted in to partial blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564121 (https://phabricator.wikimedia.org/T240300) (owner: 10Tchanders) [19:49:53] (03PS1) 10Papaul: Partman: Fix typo on puppetmaster* [puppet] - 10https://gerrit.wikimedia.org/r/565097 (https://phabricator.wikimedia.org/T239732) [19:50:49] James_F I did +2 mine, so beware with pulling. Can do later. [19:51:10] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for frlog2001 [dns] - 10https://gerrit.wikimedia.org/r/565067 (owner: 10Papaul) [19:51:22] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for frlog2001 [dns] - 10https://gerrit.wikimedia.org/r/565067 [19:51:24] Krinkle: I saw. ;-) [19:52:03] (03CR) 10Papaul: [C: 03+2] DNS: Add mgmt and production DNS for frlog2001 [dns] - 10https://gerrit.wikimedia.org/r/565067 (owner: 10Papaul) [19:52:25] James_F: Looks good [19:52:46] Cool, syncing. [19:53:08] James_F: Is there time to do https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/564097/ too? AHT gave the go-ahead. No worries if not [19:53:33] Tchanders: That was going to be my next step. [19:53:35] Krinkle: Only if it's Ok with you too [19:53:41] (03CR) 10Jforrester: [C: 03+2] Deploy partial blocks on commons wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564097 (https://phabricator.wikimedia.org/T242570) (owner: 10DannyS712) [19:54:19] Tchanders sure, I can wait. So long as it isn't a mediawiki/extensions* patch it won't complicate what James is doing [19:54:20] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable banner for wikis that recently opted in to partial blocks T240300 T242570 T242569 (duration: 01m 05s) [19:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:25] T242570: Deploy partial blocks on Wikimedia Commons - https://phabricator.wikimedia.org/T242570 [19:54:25] T240300: Introduce a temporary banner on Special:Block to inform users about upcoming partial blocks deploy - https://phabricator.wikimedia.org/T240300 [19:54:26] T242569: Deploy partial blocks on English wikipedia - https://phabricator.wikimedia.org/T242569 [19:54:38] (03Merged) 10jenkins-bot: Deploy partial blocks on commons wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564097 (https://phabricator.wikimedia.org/T242570) (owner: 10DannyS712) [19:55:11] Tchanders: No need to test, right? [19:55:29] James_F: I'd like to test if that's OK [19:55:55] Tchanders: Sure. Live on mwdebug1001. [19:56:22] James_F: All good - thanks! [19:56:33] Syncing. [19:57:32] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable partial blocks on last wiki, Commons T242570 (duration: 01m 03s) [19:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:36] SWAT done, with 150 seconds to spare. [19:57:40] Krinkle: All yours. [19:57:51] OK [20:00:05] liw and brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T2000). [20:01:41] * Krinkle testing on mwdebug1002 [20:02:40] (03PS1) 10Reedy: Leave a comment that wmfPhabricatorApiToken belongs to PhabBanBot [puppet] - 10https://gerrit.wikimedia.org/r/565098 [20:02:42] (03PS1) 10Reedy: Remove old unused OpenStackManager variables [puppet] - 10https://gerrit.wikimedia.org/r/565099 [20:05:16] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Cmjohnson) [20:09:10] Well, I'm not able to verify the bug as fixed. That's odd. [20:10:13] (03CR) 10Papaul: [C: 03+2] Partman: Fix typo on puppetmaster* [puppet] - 10https://gerrit.wikimedia.org/r/565097 (https://phabricator.wikimedia.org/T239732) (owner: 10Papaul) [20:10:36] Krinkle: The RL language fallback error? [20:10:49] James_F MMV firefox [20:10:54] Oh, that one. [20:11:45] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Papaul) [20:12:33] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frlog2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T242265 (10Papaul) a:05Papaul→03Jgreen @Jgreen All yours let me know if you have any questions [20:12:41] * Krinkle tries a few different ways [20:12:44] (03PS2) 10BryanDavis: Leave a comment that wmfPhabricatorApiToken belongs to PhabBanBot [puppet] - 10https://gerrit.wikimedia.org/r/565098 (owner: 10Reedy) [20:14:21] (03CR) 10BryanDavis: [C: 03+1] "Thanks for starting this patch Reedy. I added a bit more info to the new comment and another one about the related gerrit credentials." [puppet] - 10https://gerrit.wikimedia.org/r/565098 (owner: 10Reedy) [20:17:01] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.15/extensions/MultimediaViewer/resources/: T229484 (duration: 01m 06s) [20:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:04] T229484: After closing, browser sometimes scrolls to the top of the page in Firefox 70 - https://phabricator.wikimedia.org/T229484 [20:20:18] 10Operations, 10ORES, 10Scoring-platform-team: ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Dzahn) p:05Triage→03Normal [20:21:03] 10Operations, 10ORES, 10Scoring-platform-team (Current): Ores celery OOM event in codfw - https://phabricator.wikimedia.org/T242705 (10Halfak) [20:26:33] Krinkle: Mind syncing a touch on IS.php? [20:26:41] (Or I can.) [20:27:07] Go ahead [20:27:15] !log jforrester@deploy1001 sync-file aborted: Enable partial blocks on last wiki, (duration: 00m 01s) [20:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:24] Bah, mis-pressed return. [20:28:48] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touched IS.php for sync (duration: 01m 05s) [20:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:52] mediawiki fights against pbs [20:28:53] Clear. [20:29:11] but James_F|Away punches harder [20:29:41] * James_F|Away grins. [20:29:51] 10Operations, 10ORES, 10Scoring-platform-team: ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Halfak) Looks like the OOM error might have been old. Here's what I have now: ` $ sudo -u www-data ../venv/bin/python ores_celery.py /srv/ores/venv/lib/python3.5/site-package... [20:29:57] Krinkle: Did you want to sling out https://gerrit.wikimedia.org/r/c/565072/ for the mrj thing? [20:30:53] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) [20:31:53] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) @herron hey since @jbond is out do you want to take over this task? [20:32:14] James_F|Away Unlikely today for me, but would be nice yeah. Task has a repro for verification. [20:32:57] Yeah, I manually confirmed. OK, will do it myself now. [20:34:47] 10Operations, 10ops-codfw, 10Core Platform Team: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) [20:36:20] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10herron) Hey @Papaul, I don't think there is any specific urgency to this and it can wait until he's back, but if it needs to go sooner I could work on it. [20:43:47] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) @herron thanks in that case you can just add the server to site.pp with the role ( spare::system) and assign the task to @jbond [20:47:58] !log gerrit - adding Zoranzoki to members of extension-GoogleAdSense (endorsed by extension owner Siebrand) (T241509) [20:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:02] T241509: +2 for Zoranzoki21 in mediawiki/extensions/GoogleAdSense - https://phabricator.wikimedia.org/T241509 [20:49:35] who is in charge of CirrusSearch? [20:49:52] there's a sethload of cirrussearch-related stuff on logstash-beta [20:51:35] hauskatze: #wikimedia-discovery (me and others). On logstash-beta searching for 'cirrussearch' only reports 5-10 events per 30 minutes. Which search are you doing? [20:52:18] hi ebernhardson, thanks for the answer. I'm filtering for all events whose level != info, debug or notice [20:52:51] elasticsearch is yours as well? [20:52:53] [_field_stats] endpoint is deprecated! Use [_field_caps] instead or run a min/max aggregations on the desired fields. [20:52:58] some of it [20:53:25] Search backend error during full_text search for {redacted} after 2: index_not_found_exception: no such index [20:53:27] (03PS1) 10Herron: assign puppetmaster2003 role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/565112 (https://phabricator.wikimedia.org/T239732) [20:53:37] (03PS1) 10Jhedden: labs prometheus: only bind localhost and update vhost config [puppet] - 10https://gerrit.wikimedia.org/r/565113 (https://phabricator.wikimedia.org/T242460) [20:53:39] Krinkle: Hmm. I'm getting a 500 error on mwdebug1001 from it instead? But it works in wmf.15… https://en.wikipedia.org/w/load.php?lang=mrj&modules=startup&only=scripts&raw=1&skin=vector&uselang=en&jemimaa=2 [20:53:43] hauskatze: suggests someone created a wiki without successfully creating the index in addWiki.php [20:54:15] addWiki for Beta Cluster? [20:54:24] oh, addWiki is b0rken [20:54:25] James_F|Away: i'm randomly guessing how people add wikis to beta :) [20:54:31] * hauskatze did not [20:54:39] ebernhardson: "Not well". ;-) [20:54:46] addWiki is fixed for production. [20:54:48] there's also evenlogging stuff [20:54:54] As of a few months ago. [20:54:57] ^ [20:56:23] (03PS4) 10saper: Wikistats v2 need no symbolic link [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) [20:56:31] You can just re-run the index creation parts of the script [20:56:36] i'm running them now [20:58:20] (03PS2) 10Jhedden: labs prometheus: only bind localhost and update vhost config [puppet] - 10https://gerrit.wikimedia.org/r/565113 (https://phabricator.wikimedia.org/T242460) [21:00:04] cscott, arlolra, subbu, halfak, and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200115T2100). [21:00:20] (03CR) 10Herron: [C: 03+2] assign puppetmaster2003 role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/565112 (https://phabricator.wikimedia.org/T239732) (owner: 10Herron) [21:00:29] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/20376/" [puppet] - 10https://gerrit.wikimedia.org/r/565113 (https://phabricator.wikimedia.org/T242460) (owner: 10Jhedden) [21:00:48] (03PS1) 10Eevans: Echo: remove transition echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565120 (https://phabricator.wikimedia.org/T234963) [21:01:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for jennifer wang (jwang) - https://phabricator.wikimedia.org/T242807 (10kzimmerman) Approved as Jennifer's manager! [21:03:07] 10Operations, 10ops-codfw, 10Patch-For-Review: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10herron) a:05Papaul→03jbond >>! In T239732#5807275, @Papaul wrote: > @herron thanks in that case you can just add the server to s... [21:03:37] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.14/languages/messages/MessagesMrj.php: Fix fallbacks of mrj (Hill Mari) T242409 T242796 (duration: 01m 05s) [21:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:42] T242796: Fatal TypeError: Return of ResourceLoaderImage::getLangFallbacks() must be array (on mrj.wikipedia.org) - https://phabricator.wikimedia.org/T242796 [21:03:42] T242409: languageinfo API returns a TypeError if you request fallbacks - https://phabricator.wikimedia.org/T242409 [21:04:05] (03PS2) 10Eevans: Echo: remove transition echo seen-time storage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565120 (https://phabricator.wikimedia.org/T234963) [21:08:45] (03PS1) 10Jhedden: labs prometheus: convert apache config to template [puppet] - 10https://gerrit.wikimedia.org/r/565125 (https://phabricator.wikimedia.org/T242460) [21:09:43] (03PS5) 10saper: Wikistats v2 need no symbolic link [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) [21:09:52] (03PS2) 10Jhedden: labs prometheus: convert apache config to template [puppet] - 10https://gerrit.wikimedia.org/r/565125 (https://phabricator.wikimedia.org/T242460) [21:10:30] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/20377/" [puppet] - 10https://gerrit.wikimedia.org/r/565125 (https://phabricator.wikimedia.org/T242460) (owner: 10Jhedden) [21:12:43] (03CR) 10Jhedden: [C: 03+2] labs prometheus: convert apache config to template [puppet] - 10https://gerrit.wikimedia.org/r/565125 (https://phabricator.wikimedia.org/T242460) (owner: 10Jhedden) [21:13:47] (03PS3) 10Jhedden: labs prometheus: convert apache config to template [puppet] - 10https://gerrit.wikimedia.org/r/565125 (https://phabricator.wikimedia.org/T242460) [21:22:36] (03CR) 10saper: "Thank you @Elukey for the comments - give the @variable model a try." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/564739 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [21:24:33] (03PS3) 10saper: Wikistats v2: go live [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) [21:26:56] 10Operations, 10ops-codfw: (No Need By Date Provided) codfw: rack/setup/install puppetmaster2003.codfw.wmnet - https://phabricator.wikimedia.org/T239732 (10Papaul) @herron thanks. [21:31:14] 10Operations, 10ORES, 10Scoring-platform-team (Current): ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Halfak) [21:33:02] 10Operations, 10ops-codfw, 10Core Platform Team: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) [21:34:34] 10Operations, 10ops-codfw, 10Core Platform Team: (No Need By Date Provided) rack/setup/install restbase202[123] - https://phabricator.wikimedia.org/T241790 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi all yours [21:39:33] (03CR) 10saper: "Fixed the tab problem, thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/564745 (https://phabricator.wikimedia.org/T237752) (owner: 10saper) [21:46:03] (03PS3) 10Bstorm: Leave a comment that wmfPhabricatorApiToken belongs to PhabBanBot [puppet] - 10https://gerrit.wikimedia.org/r/565098 (owner: 10Reedy) [21:47:17] (03CR) 10Bstorm: [C: 03+2] Leave a comment that wmfPhabricatorApiToken belongs to PhabBanBot [puppet] - 10https://gerrit.wikimedia.org/r/565098 (owner: 10Reedy) [21:50:36] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for jennifer wang (jwang) - https://phabricator.wikimedia.org/T242807 (10Nuria) Approved on my end [21:53:31] (03PS2) 10Reedy: Remove old unused OpenStackManager variables [puppet] - 10https://gerrit.wikimedia.org/r/565099 [21:58:05] (03CR) 10Bstorm: k8s: Don't restart all k8s machinery to reboot a basic webservice (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/563624 (https://phabricator.wikimedia.org/T228499) (owner: 10Bstorm) [22:08:59] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 52394392 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:10:37] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11920 and 25 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:23:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:24:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:27:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:28:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:35:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:37:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:40:15] !log phabricator - disabling 'bzimport' user (T242860) [22:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:18] T242860: bzimport uses deprecated certificate auth - https://phabricator.wikimedia.org/T242860 [22:49:47] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for jennifer wang (jwang) - https://phabricator.wikimedia.org/T242807 (10jwang) @MoritzMuehlenhoff Thanks for the instruction. Here are the updated application information. Thanks! wikitech username: Jenn... [22:52:36] 10Operations, 10ORES, 10Scoring-platform-team (Current): ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Halfak) Turns out it was the statsd host. It changes from labsmon1001 to cloudmetrics1001. Now that I've done a new deployment with an updated config, we're back onli... [23:07:21] (03CR) 10Dzahn: "we are in the process of replacing some phabricator instances on stretch in cloud VPS, just needs a tiny bit more time" [puppet] - 10https://gerrit.wikimedia.org/r/563469 (owner: 10Muehlenhoff) [23:31:32] (03CR) 10Ppchelko: "The image is now being built and is ready, see Ic3cb4deceb5bebd890011b24a24016c45f33ddcf (sorry for the mess over there, I've had some une" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [23:36:14] (03CR) 10Ppchelko: "Is it possible to rename `changepropagation` to `changeprop`? It's been called changeprop historically, it's deployed as changeprop, the m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [23:51:15] (03CR) 10Ppchelko: "The config.yaml is missing isn't it? Or I misunderstand something?" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [23:53:34] (03PS1) 10Catrope: GrowthExperiments: Enable topics for suggested edits on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565154 [23:53:52] 10Operations, 10ORES, 10Scoring-platform-team (Current): ores.wmflabs.org - 503 icinga alerts - https://phabricator.wikimedia.org/T242819 (10Dzahn) 05Open→03Resolved {F31513810, size=full} [23:55:49] (03PS2) 10Catrope: GrowthExperiments: Enable topic search, behind a hidden preference [mediawiki-config] - 10https://gerrit.wikimedia.org/r/564183 (https://phabricator.wikimedia.org/T242698)