[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181116T0000). [00:00:04] stephanebisson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:08] hi! anyone know who will SWAT? [00:01:47] Hi, I can swat [00:03:06] stephanebisson: I am unexpectedly here and can help out of you want me to [00:08:50] (03PS1) 10Dzahn: mediawiki_maintenance: reverse rsync direction of /home [puppet] - 10https://gerrit.wikimedia.org/r/473956 [00:09:36] (03PS2) 10Dzahn: mediawiki_maintenance: reverse rsync direction of /home [puppet] - 10https://gerrit.wikimedia.org/r/473956 [00:10:58] (03CR) 10Lucas Werkmeister (WMDE): "Thanks! But I’m not sure if my text makes sense as a stand-alone comment here, when it was originally written as a reply to something… may" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [00:11:32] (03CR) 10Lucas Werkmeister (WMDE): "(What’s missing from that note is why HTTP was chosen in the first place – that’s something I honestly just don’t know.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [00:12:54] (03CR) 10Dzahn: [C: 032] mediawiki_maintenance: reverse rsync direction of /home [puppet] - 10https://gerrit.wikimedia.org/r/473956 (owner: 10Dzahn) [00:14:07] PROBLEM - Disk space on analytics1029 is CRITICAL: DISK CRITICAL - free space: / 2045 MB (3% inode=96%) [00:15:37] !log sbisson@deploy1001 Started scap: php-1.33.0-wmf.4/extensions/GrowthExperiments SWAT: [[gerrit:473843]] [[gerrit:473844]] [[gerrit:473845]] [00:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "In this case I actually think we could go for HTTPS :) stability is not a concern here (because we’re changing the base URI anyways, and a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [00:19:43] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:20:37] SMalyshev: we have mwmaint2001 and i see you have scripts there and i am syncing files from 2001 to 1002 now. i hope that helps? [00:21:04] RoanKattouw: do you still need files that are for this ticket https://phabricator.wikimedia.org/T178313 [00:21:19] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:21:25] there's a directory that's pretty large and is named after that [00:21:53] !log sbisson@deploy1001 sync aborted: php-1.33.0-wmf.4/extensions/GrowthExperiments SWAT: [[gerrit:473843]] [[gerrit:473844]] [[gerrit:473845]] (duration: 06m 16s) [00:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:57] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/GrowthExperiments: SWAT: [[gerrit:473843]] [[gerrit:473844]] [[gerrit:473845]] (duration: 00m 49s) [00:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:53] PROBLEM - Disk space on analytics1029 is CRITICAL: DISK CRITICAL - free space: / 2033 MB (3% inode=96%) [00:35:09] !log sbisson@deploy1001 Synchronized php-1.33.0-wmf.4/extensions/WikimediaEvents/includes/PageViews.php: SWAT: [[gerrit:473861|Exclude users where getRegistration() returns null]] (duration: 00m 47s) [00:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:35] 10Operations, 10netops: Investigate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ayounsi) [00:36:10] SMalyshev: check your home again now, i copied the files over from 2001 for your home [00:36:44] mutante: thanks, but those are only small part of what was there before [00:37:06] ah wait maybe they are in home-terbium [00:37:20] I'm done SWATing. [00:37:25] like 430M or so? [00:37:25] yeah they are there, that's good then [00:37:48] :)_ great, i'm glad [00:38:11] hmm not supposed to be so much... I guess I have to clean up [00:38:18] I only need scripts which are small [00:38:34] dont worry, others have many times that size [00:39:23] oh of course there's a ton of logs there [00:39:40] the terbium stuff has moved across more than one migration [00:40:05] and kind of with the DC switch [00:40:35] so what is cool is to move stuff out of "home-terbium" into the normal home if still used [00:40:47] or remove it if not used [00:41:31] ok, I'll sort it out, thanks! [00:41:44] :) [00:42:33] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [00:45:40] Trey314159: the restore has problems, BUT.. we still have mwmaint2001 and i copied your files from there over to 1002 directly into your home. i hope that helped.. though i'm afraid it might not be all [00:46:02] Trey314159: the "reindex" dir is there and also the dot files [00:46:47] the .sh files might be the status from a little while ago [00:55:03] !log some users reported missing files in home dirs on mwmaint1002, reversed rsyncd/ferm setup and rsynced /home from mwmaint2001 to /root on mwmaint1002, restored individually where requested, rsync is not fully automatic but puppetized with rsync::quickdatacopy [00:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:53] RECOVERY - Disk space on analytics1029 is OK: DISK OK [01:12:06] 10Operations, 10netops: Investigate network issues in codfw that caused 503 errors - https://phabricator.wikimedia.org/T209145 (10ayounsi) a:03ayounsi Not answering the question but adding data points. We periodically have similar links flap or cut, and this one doesn't seem different from the others. Lo... [01:24:04] Urbanecm: the new wikitonary is "yue" but no Wikipedia "yue" exists, only "zh-yue" ? hmm [01:24:55] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:25:12] normally there is always a matching wikipedia once a lang exists [01:26:27] https://phabricator.wikimedia.org/T105999 .. ok.. hmm [01:26:58] T10217 sigh.. got it [01:26:59] T10217: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) - https://phabricator.wikimedia.org/T10217 [01:41:45] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [02:10:41] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:11:47] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [02:28:29] (03CR) 10Gergő Tisza: "> Maybe we could have a generic DisabledSpecialPage which just displays a message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470956 (https://phabricator.wikimedia.org/T208083) (owner: 10Tim Starling) [02:47:06] (03PS1) 10Herron: profile::rsyslog::kafka_shipper: use eqiad logging kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/473998 (https://phabricator.wikimedia.org/T206454) [02:47:37] (03PS2) 10Herron: peopleweb: include rsyslog kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/473847 (https://phabricator.wikimedia.org/T205852) [02:50:56] (03CR) 10Herron: [C: 032] profile::rsyslog::kafka_shipper: use eqiad logging kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/473998 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [03:01:30] (03CR) 10Herron: [C: 032] peopleweb: include rsyslog kafka_shipper [puppet] - 10https://gerrit.wikimedia.org/r/473847 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [03:36:15] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 888.43 seconds [03:43:13] (03PS1) 10Herron: logstash::input::kafka: add codec param [puppet] - 10https://gerrit.wikimedia.org/r/474021 (https://phabricator.wikimedia.org/T206454) [03:48:29] PROBLEM - Disk space on analytics1029 is CRITICAL: DISK CRITICAL - free space: / 2095 MB (3% inode=96%) [03:50:32] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/13531/" [puppet] - 10https://gerrit.wikimedia.org/r/474021 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [03:54:57] (03CR) 10Herron: [C: 032] "self-merging this since the approach is borrowed from logstash::input::tcp and per https://www.elastic.co/guide/en/logstash/current/plugin" [puppet] - 10https://gerrit.wikimedia.org/r/474021 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [04:09:43] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.41 seconds [04:11:55] PROBLEM - Disk space on analytics1029 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied [04:19:14] (03PS1) 10Herron: profile::logstash::collector: set kafka shipper input codec to json [puppet] - 10https://gerrit.wikimedia.org/r/474026 (https://phabricator.wikimedia.org/T206454) [04:22:18] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/13532/" [puppet] - 10https://gerrit.wikimedia.org/r/474026 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [04:22:39] (03CR) 10Herron: [C: 032] profile::logstash::collector: set kafka shipper input codec to json [puppet] - 10https://gerrit.wikimedia.org/r/474026 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [04:33:53] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:59:25] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:02:41] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:11:37] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [05:47:46] !log uploaded certcentral 0.7 to apt.wikimedia.org (stretch) - T208967 T209475 [05:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:51] T208967: Avoid using acme.client poll_and_finalize() method - https://phabricator.wikimedia.org/T208967 [05:47:51] T209475: store non-config files in /var/lib/certcentral - https://phabricator.wikimedia.org/T209475 [05:57:38] (03PS1) 10Vgutierrez: certcentral: update live_certs path [puppet] - 10https://gerrit.wikimedia.org/r/474058 (https://phabricator.wikimedia.org/T209475) [05:59:34] (03CR) 10Vgutierrez: [C: 032] certcentral: update live_certs path [puppet] - 10https://gerrit.wikimedia.org/r/474058 (https://phabricator.wikimedia.org/T209475) (owner: 10Vgutierrez) [06:27:40] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install dbstore100[3-5].eqiad.wmnet - https://phabricator.wikimedia.org/T209620 (10Marostegui) Even know the name `dbstore1001` suggests otherwise....this host has nothing to do with the future usage for these dbstore1003-5 hosts, so it can... [06:28:58] !log Set sync_binlog=0 and trx_commit=2 on dbstore2002:3313 to let it catch up [06:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:53] (03PS1) 10Marostegui: db-codfw.php: Pool pc2009 on pc3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474073 (https://phabricator.wikimedia.org/T208383) [06:31:18] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install dbstore100[3-5].eqiad.wmnet - https://phabricator.wikimedia.org/T209620 (10elukey) a:05elukey>03Cmjohnson >>! In T209620#4752864, @Marostegui wrote: > Even know the name `dbstore1001` suggests otherwise....this host has nothing t... [06:32:12] (03PS1) 10Marostegui: pc2009.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474074 (https://phabricator.wikimedia.org/T208383) [06:33:26] (03CR) 10Marostegui: [C: 032] db-codfw.php: Pool pc2009 on pc3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474073 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:33:39] (03CR) 10Marostegui: [C: 032] pc2009.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/474074 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:35:31] (03Merged) 10jenkins-bot: db-codfw.php: Pool pc2009 on pc3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474073 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:36:43] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool pc2009 in pc3 - T208383 (duration: 00m 56s) [06:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:46] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [06:37:12] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) pc2009 has been pooled in. I am going to leave the weekend go by before starting the decommission proces... [06:37:24] 10Operations, 10DBA, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10Marostegui) [06:37:59] (03CR) 10jenkins-bot: db-codfw.php: Pool pc2009 on pc3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474073 (https://phabricator.wikimedia.org/T208383) (owner: 10Marostegui) [06:38:07] PROBLEM - Hadoop NodeManager on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:41:39] checking --^ [06:51:57] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:00:36] (03PS1) 10CRusnov: Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) [07:03:32] (03CR) 10CRusnov: [C: 04-1] "There is still a testing failure in part of the tests that doesn't seem to have anything to do with my changes so i'm a bit confused. Othe" [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [07:03:44] (03CR) 10jerkins-bot: [V: 04-1] Make the puppetdb backend process primitive types for queries. [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [07:11:59] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [07:16:33] mutante, on T205546 it was explicitly requested by LangCom clerk (StevenJ81) that this project should be created on yue.wiktionary.org, so I did so [07:16:34] T205546: Create Wiktionary Cantonese - https://phabricator.wikimedia.org/T205546 [07:27:53] (03PS1) 10Elukey: Avoid using /var/lib/hadoop/data/l on an1029 due to a broken disk [puppet] - 10https://gerrit.wikimedia.org/r/474094 [07:28:41] (03CR) 10Elukey: [C: 032] Avoid using /var/lib/hadoop/data/l on an1029 due to a broken disk [puppet] - 10https://gerrit.wikimedia.org/r/474094 (owner: 10Elukey) [07:29:25] RECOVERY - Hadoop NodeManager on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [07:29:29] RECOVERY - Disk space on analytics1029 is OK: DISK OK [07:32:00] !log forced reboot + fsck + removal of /var/lib/hadoop/data/l from fstab on analytics1029 [07:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:55] (03PS2) 10Muehlenhoff: Share keytabs directory for multiple services [puppet] - 10https://gerrit.wikimedia.org/r/473760 [08:09:53] (03CR) 10Muehlenhoff: [C: 032] Share keytabs directory for multiple services [puppet] - 10https://gerrit.wikimedia.org/r/473760 (owner: 10Muehlenhoff) [08:10:12] 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) >>! In T206341#4751919, @Gilles wrote: > Have you diffed the output coming from HHVM and PHP7, to ensure that they're generating the same HTML for these pages? Ou... [08:16:47] 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) Forcing a reparse of the Obama page by requesting `curl -g -b "PHP_ENGINE=php7" -H 'Host: en.wikipedia.org' 'http://mw1261.eqiad.wmnet/w/api.php?action=parse&tex... [08:22:01] (03PS1) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [08:23:50] 10Operations, 10ops-eqiad, 10media-storage: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10fgiunchedi) >>! In T209618#4751090, @RobH wrote: > Well, netbox makes reviewing current rack placement easier, since it summarizes it: > > https://netbox.wikimedia.org/dci... [08:24:22] 10Operations, 10ops-eqiad, 10media-storage: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10fgiunchedi) a:05fgiunchedi>03RobH [08:24:38] (03PS2) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [08:27:16] (03PS3) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [08:31:13] (03CR) 10Elukey: "changes are good https://puppet-compiler.wmflabs.org/compiler1002/13535/ but it seems that those extra carriage returns are really annoyin" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [08:36:01] <_joe_> elukey: you can fix those with careful usage of "-" in templates [08:37:38] (03CR) 10DCausse: "thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/473796 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [08:37:48] (03CR) 10DCausse: [C: 031] Add administrative module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473796 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [08:38:33] _joe_ yep I was about to do so :) [08:38:37] _joe_: any idea if you can actually query elastic from grafana? or is the datasource just left there hanging around / legacy ? [08:39:09] <_joe_> addshore: what datasource is that? [08:39:15] "elastic" [08:39:20] <_joe_> you can guess from my answer I have no idea :P [08:39:41] addshore: isn't it only for adding deployment markups in the graphs? [08:39:42] i looked through puppet etc, but couldnt actually find the configurations for the datasources [08:39:56] dcausse: the deployment markings come from graphite too afaik [08:40:10] (03PS2) 10Urbanecm: Change sitename of shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473670 (https://phabricator.wikimedia.org/T206777) [08:40:27] !log installing curl security updates on jessie [08:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:41] addshore: were you able to identify which elastic hosts it tried to connect? [08:40:54] dcausse: https://github.com/wikimedia/puppet/blob/production/modules/grafana/files/dashboards/varnish-aggregate-client-status-codes#L5-L9 << example deploy marking [08:41:11] dcausse: nope, all I see if "elastic" in the frontend [08:41:47] in the past grafana used to store its dashboard in elastic IIRC, not sure if it's still the case? [08:41:52] https://usercontent.irccloud-cdn.com/file/r8gBZxDZ/image.png [08:41:59] aaah, maybe that is what it is left over from? [08:42:10] no clue :/ [08:42:36] any ideas where the datasources for grafana are actually configured [08:42:37] ? [08:42:50] but I'm not aware of any elastic cluster that hold metrics data that could be used in grafana [08:42:52] I seem to remember it might just be in the UI :/ [08:43:18] dcausse: well, the logstash one could be as far as I know? (at least I guess..) [08:43:34] (03PS1) 10Urbanecm: Remove wgMetaNamespaceTalk for shnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474124 (https://phabricator.wikimedia.org/T206777) [08:43:52] seems dangerous to expose logstash data through grafana I think [08:44:30] indeed, perhaps [08:45:17] thanks for the help :) [08:45:28] yw :) [08:45:31] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:45:51] I know it's Friday, but https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/474124 should fix a very annoying error, because wgMetaNamesapce==wgMetaNamespaceTalk for shnwiki (yeah, same values). This causes MW to be confused. Do you think it can be deployed now? [08:45:56] (hi everyone, btw) [08:45:59] (03PS4) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [08:57:27] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10aborrero) [09:00:16] (03PS1) 10Filippo Giunchedi: statsite: move from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) [09:00:18] (03PS5) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [09:00:32] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10aborrero) >>! In T209642#4752043, @hashar wrote: > #cloud-services-t... [09:00:40] (03CR) 10jerkins-bot: [V: 04-1] statsite: move from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:02:33] (03PS2) 10Filippo Giunchedi: statsite: move from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) [09:02:54] (03CR) 10jerkins-bot: [V: 04-1] statsite: move from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:04:54] (03PS3) 10Filippo Giunchedi: statsite: move from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) [09:04:56] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/13538/" [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [09:05:05] (03CR) 10DCausse: [C: 031] Add Puppet module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473735 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [09:06:36] (03CR) 10DCausse: [C: 031] Add Icinga module [software/spicerack] - 10https://gerrit.wikimedia.org/r/473506 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [09:06:55] 10Operations, 10decommission, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, and 2 others: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10MoritzMuehlenhoff) [09:07:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10jijiki) a:05jijiki>03None [09:11:01] (03PS1) 10Alexandros Kosiaris: kubernetes: Move runtime-config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/474127 [09:11:07] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [09:12:06] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: Move runtime-config to hiera [puppet] - 10https://gerrit.wikimedia.org/r/474127 (owner: 10Alexandros Kosiaris) [09:14:10] (03PS1) 10Filippo Giunchedi: ci: use statsite for localhost statsd aggregation [puppet] - 10https://gerrit.wikimedia.org/r/474128 (https://phabricator.wikimedia.org/T205870) [09:19:37] (03PS1) 10Filippo Giunchedi: prometheus: ignore all rsyslog actions with default names [puppet] - 10https://gerrit.wikimedia.org/r/474133 [09:20:43] (03CR) 10Gehel: "A few style comments inline" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [09:20:58] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: ignore all rsyslog actions with default names [puppet] - 10https://gerrit.wikimedia.org/r/474133 (owner: 10Filippo Giunchedi) [09:21:31] !log removed labvirt1016 from debmonitor db, got renamed to cloudvirt1016 [09:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:19] (03CR) 10DCausse: [C: 031] elasticsearch: create multiple elasticsearch instances on cirrus codfw [puppet] - 10https://gerrit.wikimedia.org/r/473258 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [09:25:47] !log installing postgres updates on labsdb1006 [09:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:37] (03CR) 10DCausse: elasticsearch: allow filtering instances to be deployed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [09:31:11] !log Set back sync_binlog=1 and trx_commit=1 after dbstore2002:3313 has caught up [09:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:06] (03CR) 10Gehel: [C: 04-1] "very minor comments inline" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/473796 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [10:12:03] (03PS6) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [10:13:48] (03PS1) 10Filippo Giunchedi: grafana: deprecate Diamond metrics in swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/474138 (https://phabricator.wikimedia.org/T183454) [10:15:40] (03PS7) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [10:17:53] (03PS2) 10ArielGlenn: add twentyafterfour to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/472951 (https://phabricator.wikimedia.org/T209176) [10:18:12] (03CR) 10jerkins-bot: [V: 04-1] add twentyafterfour to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/472951 (https://phabricator.wikimedia.org/T209176) (owner: 10ArielGlenn) [10:20:10] mv: cannot stat '/srv/workspace/puppet/.tox/log/*' that's a new one [10:20:38] (03CR) 10Elukey: "Better now https://puppet-compiler.wmflabs.org/compiler1002/13540/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [10:24:33] (03CR) 10Filippo Giunchedi: [C: 032] memcached: remove diamond::collector resource [puppet] - 10https://gerrit.wikimedia.org/r/469258 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:26:46] (03CR) 10Muehlenhoff: [C: 031] memcached: remove diamond::collector resource [puppet] - 10https://gerrit.wikimedia.org/r/469258 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [10:28:01] (03Abandoned) 10Filippo Giunchedi: Test for unreferenced files introduced by changes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/354939 (owner: 10Filippo Giunchedi) [10:29:01] (03Abandoned) 10Filippo Giunchedi: svc: add graphite LVS addresses [dns] - 10https://gerrit.wikimedia.org/r/289635 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [10:29:09] (03Abandoned) 10Filippo Giunchedi: graphite: add realserver class [puppet] - 10https://gerrit.wikimedia.org/r/289637 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [10:29:19] (03Abandoned) 10Filippo Giunchedi: graphite: add multiple clusters per carbon-c-relay route [puppet] - 10https://gerrit.wikimedia.org/r/289211 (owner: 10Filippo Giunchedi) [10:30:15] (03CR) 10ArielGlenn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/472951 (https://phabricator.wikimedia.org/T209176) (owner: 10ArielGlenn) [10:31:35] (03CR) 10ArielGlenn: [C: 032] add twentyafterfour to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/472951 (https://phabricator.wikimedia.org/T209176) (owner: 10ArielGlenn) [10:31:38] (03Abandoned) 10Filippo Giunchedi: nutcracker: listen on localhost for stats [puppet] - 10https://gerrit.wikimedia.org/r/324642 (https://phabricator.wikimedia.org/T111934) (owner: 10Filippo Giunchedi) [10:32:59] 10Operations, 10OTRS: Upgrade to OTRS version 5.0.32 - https://phabricator.wikimedia.org/T209691 (10akosiaris) [10:33:05] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Mukunda to releasers-mediawiki - https://phabricator.wikimedia.org/T209176 (10ArielGlenn) In half an hour or so this will be live everywhere and you can check that it's working. [10:39:00] !log upgrade OTRS to 5.0.32 T209691 [10:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:04] T209691: Upgrade to OTRS version 5.0.32 - https://phabricator.wikimedia.org/T209691 [10:40:05] (03Abandoned) 10Ema: cacheproxy: only call cron_splay() for hosts in $all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/472384 (https://phabricator.wikimedia.org/T208588) (owner: 10Ema) [10:48:11] (03PS1) 10Ema: Drop 0011-logging-broken-pipe-no-spam.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474146 (https://phabricator.wikimedia.org/T204225) [10:50:00] 10Operations, 10OTRS: Upgrade to OTRS version 5.0.32 - https://phabricator.wikimedia.org/T209691 (10akosiaris) 05Open>03Resolved p:05Triage>03Normal Upgrade done, resolving [10:56:37] 10Operations: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Urbanecm) [11:00:50] 10Operations, 10monitoring: Update prometheus-node-exporter NTP metrics - https://phabricator.wikimedia.org/T208875 (10fgiunchedi) Indeed, upgrading node-exporter from 0.14 to 0.16 at least would entail making some changes to e.g. command line flags and cater for the renamed metrics e.g. https://github.com/pro... [11:02:38] (03PS1) 10Ema: Add 0009-verify-config-segfault.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474151 (https://phabricator.wikimedia.org/T204209) [11:07:54] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10ArielGlenn) [11:08:55] !log kartik@deploy1001 Started deploy [cxserver/deploy@473b0de]: Update cxserver to b7cdb26 (T208831, T203077, T203160, T206777) [11:08:57] 10Operations, 10Puppet, 10User-Joe: Puppet4: hiera() can only be called using the 4.x function API. - https://phabricator.wikimedia.org/T179181 (10ArielGlenn) [11:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:03] T208831: Make Apertium tests independent of Labs service - https://phabricator.wikimedia.org/T208831 [11:09:03] T203160: CX2: Highlight (and skip) references with a template that could not be adapted - https://phabricator.wikimedia.org/T203160 [11:09:03] T206777: Create Wikipedia Shan - https://phabricator.wikimedia.org/T206777 [11:09:04] T203077: Performance analysis for translate API - https://phabricator.wikimedia.org/T203077 [11:09:24] 10Operations, 10Puppet: Fix regex.yaml single-regex issue - https://phabricator.wikimedia.org/T183565 (10ArielGlenn) [11:10:30] 10Operations, 10MediaWiki-Configuration, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597 (10ArielGlenn) [11:12:15] 10Operations: dumps: update docs to reflect staged dumps and xml streams - https://phabricator.wikimedia.org/T111018 (10ArielGlenn) 05Open>03Resolved This got done years ago. Closing. [11:13:21] !log kartik@deploy1001 Finished deploy [cxserver/deploy@473b0de]: Update cxserver to b7cdb26 (T208831, T203077, T203160, T206777) (duration: 04m 26s) [11:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:20] 10Operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917 (10ArielGlenn) This is now dependent on the bandwith caps for labstore 1006,7. There's a task for that: T191491 [11:14:32] 10Operations, 10Datasets-General-or-Unknown: Provide a good download service of dumps from Wikimedia - https://phabricator.wikimedia.org/T122917 (10ArielGlenn) p:05Triage>03Normal [11:22:18] (03PS1) 10Ema: trafficserver (8.0.0-1wm2) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474156 (https://phabricator.wikimedia.org/T204209) [11:23:26] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Urbanecm) [11:25:41] (03PS1) 10Ladsgroup: ores: Drop old b/c configs [puppet] - 10https://gerrit.wikimedia.org/r/474157 (https://phabricator.wikimedia.org/T209587) [11:29:44] 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) After more thorough analisys of parsing the Obama page: - At low concurrency, PHP 7.2 thoroughly outperforms HHVM on parsing-heavy jobs - When concurrency is high... [11:35:02] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [11:38:15] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:40:06] (03CR) 10Ema: [C: 032] Drop 0011-logging-broken-pipe-no-spam.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474146 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [11:41:35] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [11:43:40] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Alexsdutton) There's another SLO which I think ought to be relevant: ensuring that the da... [11:44:41] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) There is no rewrite rule for zh-yue wiktionary; there is one for yue wikipedia. See line 97: https://gerrit.wikimedia.org/r/plugins/gitiles/oper... [11:45:00] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) p:05Triage>03Normal [11:45:46] 10Operations, 10decommission, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, and 2 others: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10ArielGlenn) p:05Triage>03Normal [11:46:16] (03PS2) 10Ema: Add 0009-verify-config-segfault.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474151 (https://phabricator.wikimedia.org/T204209) [11:49:41] (03CR) 10jerkins-bot: [V: 04-1] Add 0009-verify-config-segfault.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474151 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [11:51:23] (03CR) 10Ema: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474151 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [11:54:28] (03PS1) 10Ladsgroup: ores: Move all of celery configs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/474158 (https://phabricator.wikimedia.org/T209587) [11:55:07] (03CR) 10jerkins-bot: [V: 04-1] ores: Move all of celery configs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/474158 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [11:58:45] (03CR) 10Ladsgroup: "please do not merge this yet. I need to get some stuff in it first" [puppet] - 10https://gerrit.wikimedia.org/r/474158 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [12:04:25] (03PS1) 10ArielGlenn: use lbzip2 for recompression of wikidata weeky json dumps [puppet] - 10https://gerrit.wikimedia.org/r/474159 (https://phabricator.wikimedia.org/T206535) [12:06:30] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Drop old b/c configs [puppet] - 10https://gerrit.wikimedia.org/r/474157 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [12:06:43] 10Operations, 10ops-eqiad, 10Dumps-Generation: Move dumpsdata1001 - https://phabricator.wikimedia.org/T207278 (10ArielGlenn) This move is complete, no? Is there anything left to do before closing? [12:06:49] PROBLEM - Host restbase1014 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:31] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:07:37] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:07:41] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:07:51] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:07:57] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:07:59] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:07:59] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:08:09] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:08:09] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:08:21] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:08:21] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:08:21] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:08:21] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:08:21] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [12:08:59] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [12:09:05] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [12:09:09] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [12:09:19] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [12:09:21] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [12:09:25] I'll take a look at restbase1014 [12:10:21] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [12:10:29] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [12:10:33] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [12:10:34] !log reboot restbase1014, nothing on console [12:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:57] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [12:11:01] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [12:11:11] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [12:11:37] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [12:11:53] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [12:12:17] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [12:13:27] RECOVERY - Host restbase1014 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [12:13:30] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10GTirloni) [12:14:55] PROBLEM - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:15:01] PROBLEM - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:15:22] nothing I can see in the logs, cassandra should recover soon [12:15:49] PROBLEM - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.137 and port 9042: Connection refused [12:15:49] PROBLEM - cassandra-a SSL 10.64.48.135:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:16:11] PROBLEM - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.135 and port 9042: Connection refused [12:16:15] PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused [12:23:37] RECOVERY - cassandra-a SSL 10.64.48.135:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-a valid until 2020-06-24 13:01:08 +0000 (expires in 586 days) [12:23:55] RECOVERY - cassandra-c SSL 10.64.48.137:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-c valid until 2020-06-24 13:01:10 +0000 (expires in 586 days) [12:23:57] RECOVERY - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is OK: SSL OK - Certificate restbase1014-b valid until 2020-06-24 13:01:09 +0000 (expires in 586 days) [12:24:41] RECOVERY - cassandra-c CQL 10.64.48.137:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.137 port 9042 [12:25:03] RECOVERY - cassandra-a CQL 10.64.48.135:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.135 port 9042 [12:25:07] RECOVERY - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is OK: TCP OK - 0.000 second response time on 10.64.48.136 port 9042 [12:26:05] aaand we're back [12:26:36] it would be interesting to understand though why restbase-dev was also affected [12:35:27] for that matter, why did restbase1014 dying cause all the rest of its cluster to report errors? [12:47:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10Dzahn) 05Open>03stalled a:03jlinehan [12:49:44] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10hashar) [13:02:20] (03CR) 10Hashar: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13541/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474128 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [13:02:35] !log installing spamassassin security update on mendelevium [13:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:04] godog: hello, we can do the profile::statsite for zuul if you want ( https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474128/ ) puppet compile looks good [13:05:47] (03CR) 10Volans: "Nice! A bunch of comments inline, most are minor nitpicks (those are marked [nitpicks] ;) )." (0317 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [13:07:31] (03CR) 10Muehlenhoff: [C: 031] grafana: deprecate Diamond metrics in swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/474138 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [13:09:51] (03PS8) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce/Hive [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [13:11:42] dcausse: so yes, I don't think logstash is configured properly at all, but i just realized you would actually be able to configure the datasource to make the requests in the user browser, and then have the ability to show graphs in grafana using logstash data when the use is already logged into logstash in the browser [13:12:19] that would remove the risk of having grafana do the tunneling to logstash in the backend [13:12:55] which interestingly (depending on cors) means I could run a grafana instance locally to display data for both logstash and grafana on a single dashboard [13:14:59] I think that's what kibana does, it sends json queries directly to elastic and understands elastic response, there no backend layer in kibana iirc [13:17:28] addshore: so yes if you have access to logstash data then you could build a dashboard locally (with proper cors setup) [13:19:21] so yes with the current grafana datasource, the elastic source points to grafana.wikimedia.org/_msearch, which is just not going to work :P [13:25:57] hashar: sure would work for me, does it work for you on a fri? I guess validation is pretty straightforward but will require a zuul restart [13:29:05] godog: yeah I am fine with that. Lets push the puppet change then I will restart zuul [13:29:16] the CI pipes are quiet today [13:30:32] hashar: ok, I need to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/474125 first [13:30:47] nice [13:33:59] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:36:13] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:38:14] (03PS2) 10Filippo Giunchedi: grafana: deprecate Diamond metrics in swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/474138 (https://phabricator.wikimedia.org/T183454) [13:39:00] (03CR) 10Filippo Giunchedi: [C: 032] grafana: deprecate Diamond metrics in swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/474138 (https://phabricator.wikimedia.org/T183454) (owner: 10Filippo Giunchedi) [13:39:00] [contint1001.wikimedia.org] out: docker.errors.BuildError: The command '/bin/sh -c echo 'Acquire::http::Proxy "http://webproxy.eqiad.wmnet:8080";' > /etc/apt/apt.conf.d/80_proxy && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install --yes gpg wget --no-install-recommends && rm -f /etc/apt/apt.conf.d/80_proxy && apt-get clean && rm -rf /var/lib/apt/lists/* && cd /tmp && wget http://www.apache.org/d [13:39:01] ist/maven/maven-3/3.5.2/binaries/apache-maven-3.5.2-bin.tar.gz && gpg --import /tmp/KEYS && gpg --verify apache-maven-3.5.2-bin.tar.gz.asc && tar -C /opt -zxf apache-maven-3.5.2-bin.tar.gz && apt purge --yes gpg wget && rm -rf ~/.gnupg' returned a non-zero code: 8 [13:39:02] GRRR [13:39:11] wrong chan [13:40:06] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [13:40:23] (03PS2) 10Filippo Giunchedi: ci: use statsite for localhost statsd aggregation [puppet] - 10https://gerrit.wikimedia.org/r/474128 (https://phabricator.wikimedia.org/T205870) [13:41:14] (03CR) 10Filippo Giunchedi: [C: 032] statsite: move from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [13:41:21] (03PS4) 10Filippo Giunchedi: statsite: move from role to profile [puppet] - 10https://gerrit.wikimedia.org/r/474125 (https://phabricator.wikimedia.org/T205870) [13:41:51] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:19] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [13:42:23] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:27] PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:42:27] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/news (get In the News content for unsupported language (with aggregated=true)) timed out before a response was received: /{domain}/v1/data/javascript/mobile/pagelib (Get javascript bundle for pa [13:42:27] out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received [13:42:37] PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:05] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [13:44:05] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [13:44:33] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [13:44:39] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [13:45:39] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [13:45:59] PROBLEM - Check whether ferm is active by checking the default input chain on scb1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [13:46:09] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1043 bytes in 0.002 second response time [13:46:43] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [13:46:43] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:46:47] RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [13:46:49] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [13:46:54] (03PS1) 10Andrew Bogott: Horizon: move three projects to eqiad1: [puppet] - 10https://gerrit.wikimedia.org/r/474180 (https://phabricator.wikimedia.org/T204745) [13:46:55] RECOVERY - Check whether ferm is active by checking the default input chain on scb1001 is OK: OK ferm input default policy is set [13:46:57] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.003 second response time [13:48:02] (03CR) 10Andrew Bogott: [C: 032] Horizon: move three projects to eqiad1: [puppet] - 10https://gerrit.wikimedia.org/r/474180 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [13:50:11] (03PS3) 10Filippo Giunchedi: ci: use statsite for localhost statsd aggregation [puppet] - 10https://gerrit.wikimedia.org/r/474128 (https://phabricator.wikimedia.org/T205870) [13:50:19] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler1002/13543/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/474128 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [13:53:03] (03PS1) 10Marostegui: mariadb: Reimage db1118 as mariadb [puppet] - 10https://gerrit.wikimedia.org/r/474182 [13:54:25] (03PS1) 10Alexandros Kosiaris: url-downloader: Add kube pods in allowed ips [puppet] - 10https://gerrit.wikimedia.org/r/474183 [13:54:49] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/13544/" [puppet] - 10https://gerrit.wikimedia.org/r/474182 (owner: 10Marostegui) [13:55:47] hashar: change merged [13:57:44] (03PS1) 10Addshore: Define a new 'Wikibase' log channel to use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474185 (https://phabricator.wikimedia.org/T207850) [14:02:28] (03PS2) 10Alexandros Kosiaris: url-downloader: Add kube pods in allowed ips [puppet] - 10https://gerrit.wikimedia.org/r/474183 [14:04:01] (03CR) 10Ema: [C: 032] Add 0009-verify-config-segfault.patch [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474151 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [14:04:09] (03PS2) 10Ema: trafficserver (8.0.0-1wm2) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474156 (https://phabricator.wikimedia.org/T204209) [14:04:49] (03CR) 10Gehel: elasticsearch: allow filtering instances to be deployed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [14:05:31] (03CR) 10Alexandros Kosiaris: [C: 032] url-downloader: Add kube pods in allowed ips [puppet] - 10https://gerrit.wikimedia.org/r/474183 (owner: 10Alexandros Kosiaris) [14:05:42] (03CR) 10jerkins-bot: [V: 04-1] trafficserver (8.0.0-1wm2) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474156 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [14:06:43] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [14:08:09] (03PS8) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [14:08:40] (03CR) 10Ema: "recheck" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474156 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [14:10:27] (03CR) 10DCausse: [C: 031] "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) (owner: 10Gehel) [14:10:46] (03PS9) 10Gehel: elasticsearch: allow filtering instances to be deployed [puppet] - 10https://gerrit.wikimedia.org/r/473216 (https://phabricator.wikimedia.org/T207918) [14:12:09] (03CR) 10Gehel: [C: 04-1] maps: update SQL script location for kartotherian (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/473736 (https://phabricator.wikimedia.org/T209566) (owner: 10Mathew.onipe) [14:14:24] (03CR) 10Gehel: [C: 031] "LGTM, waiting for a review from mholloway / msantos since this will affect the instances running on WMCS" [puppet] - 10https://gerrit.wikimedia.org/r/473731 (https://phabricator.wikimedia.org/T209570) (owner: 10Mathew.onipe) [14:14:55] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Hydriz) The WikimediaIncubator extension is working as intended, as it is supposed to direct users to the main page of the wiki. I think a rewrite rule in `... [14:19:06] (03PS1) 10Muehlenhoff: Configure Kerberos support for Druid (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474188 [14:19:39] (03CR) 10jerkins-bot: [V: 04-1] Configure Kerberos support for Druid (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474188 (owner: 10Muehlenhoff) [14:21:28] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [14:21:45] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: trafficserver debian-glue builds failing on integration-slave-jessie-1001: No space left on device - https://phabricator.wikimedia.org/T209703 (10ema) [14:22:01] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: trafficserver debian-glue builds failing on integration-slave-jessie-1001: No space left on device - https://phabricator.wikimedia.org/T209703 (10ema) p:05Triage>03Normal [14:22:24] (03CR) 10Ema: [C: 032] trafficserver (8.0.0-1wm2) stretch-wikimedia; urgency=medium [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/474156 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema) [14:23:26] (03PS2) 10Marostegui: mariadb: Reimage db1118 as mariadb [puppet] - 10https://gerrit.wikimedia.org/r/474182 [14:25:11] (03CR) 10Marostegui: [C: 032] mariadb: Reimage db1118 as mariadb [puppet] - 10https://gerrit.wikimedia.org/r/474182 (owner: 10Marostegui) [14:26:28] <_joe_> !log repooling the mw canaries [14:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:28] (03PS2) 10Muehlenhoff: Configure Kerberos support for Druid (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474188 [14:28:51] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw126[1-6].*,dc=eqiad,cluster=appserver [14:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:04] (03CR) 10jerkins-bot: [V: 04-1] Configure Kerberos support for Druid (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474188 (owner: 10Muehlenhoff) [14:29:30] <_joe_> !log re-depooling mw1261 for php-fpm testing [14:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:04] godog: perfect. I am going to restart Zuul soonish [14:33:16] hashar: kk, let me know if I can help [14:33:58] 10Operations, 10decommission, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, and 2 others: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10Dzahn) >>! In T209642#4753118, @aborrero wrote: > Do we have info on specs and expiration dates for the HW?... [14:36:08] !log Gracefully stopping zuul (kill -SIGUSR1) [14:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:26] [contint1001.wikimedia.org] out: => Building image docker-registry.discovery.wmnet/releng/java8:0.4.2 [14:38:26] [contint1001.wikimedia.org] out: ERROR: image docker-registry.discovery.wmnet/releng/java8 failed to build, see logs for details [14:38:27] ... [14:39:16] !log trafficserver 8.0.0-1wm2 uploaded to stretch-wikimedia T204225 T204209 [14:39:19] 10Operations, 10decommission, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, and 2 others: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10faidon) This specific HW is /very/ old and is already overdue for decomissioning (by 3 years no less). But m... [14:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:21] T204225: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 [14:39:21] T204209: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 [14:40:20] gpg: failed to start agent '/usr/bin/gpg-agent': No such file or directory [14:40:21] gpg: can't connect to the agent: No such file or directory [14:40:21] gpg: Total number processed: 63 [14:40:22] it is broken :( [14:41:02] 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) Results for more endpoints: * PHP7 outperforms HHVM significantly for requests that involve `/w/static.php` (so most static files we serve), but while the relativ... [14:42:00] (03CR) 10MSantos: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/473731 (https://phabricator.wikimedia.org/T209570) (owner: 10Mathew.onipe) [14:42:11] something must have changed in the gpg debian pakcage ;( [14:43:41] sorry wrong chan again [14:47:51] (03PS3) 10Muehlenhoff: Configure Kerberos support for Druid (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474188 [14:50:19] (03PS9) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce/Hive/Oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [14:52:04] CI is on hold waiting for Zuul to restart [14:56:20] !log upgrade cp-ats to 8.0.0-1wm2 T204225 T204209 [14:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:24] T204225: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 [14:56:25] T204209: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 [14:56:28] !log Create ipblocks_restrictions on labswiki and labtestwiki on db1073 - T209674 [14:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:51] (03PS10) 10Elukey: Introduce new security directives for Yarn/HDFS/MapReduce/Hive/Oozie [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 [15:01:15] !log restarting zuul with 1300 events to process [15:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:21] godog: I hvae restarted Zuul [15:01:55] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13548/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/474113 (owner: 10Elukey) [15:02:15] hashar: neat, I'll double check metrics are still making it as they should [15:02:39] (03CR) 10jerkins-bot: [V: 04-1] Configure Kerberos support for Druid (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/474188 (owner: 10Muehlenhoff) [15:05:21] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) mc1019 recovered nicely, and I can confirm from https://grafana.wikimed... [15:06:33] godog: I think it works. The Zuul status page has some graphite graphs which seems to have been updated [15:06:43] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 (10elukey) [15:07:13] hashar: yeah confirmed ! [15:07:19] awesome thank you! [15:09:05] 10Operations, 10Gadgets, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), and 4 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Status: we are still waiting for https://gerrit.wikimedia.or... [15:12:20] mutante: thanks for finding a version of my files. They are a little out of date, but I can work with them, and it's way better than having nothing! [15:14:22] (03CR) 10Mholloway: [C: 031] "Seems reasonable to me." [puppet] - 10https://gerrit.wikimedia.org/r/473731 (https://phabricator.wikimedia.org/T209570) (owner: 10Mathew.onipe) [15:16:04] 10Operations, 10Traffic: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) [15:17:56] (03PS1) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) [15:18:24] (03PS2) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) [15:19:34] (03PS3) 10Imarlier: wmf-config: Enable wgMFNoindexPages for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) [15:22:19] (03PS1) 10Elukey: Update an-worker1080's DHCP MAC address (10G interface) [puppet] - 10https://gerrit.wikimedia.org/r/474273 (https://phabricator.wikimedia.org/T207192) [15:23:02] (03CR) 10Elukey: [C: 032] Update an-worker1080's DHCP MAC address (10G interface) [puppet] - 10https://gerrit.wikimedia.org/r/474273 (https://phabricator.wikimedia.org/T207192) (owner: 10Elukey) [15:24:00] 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Joe) Last thing to note: - `pm = static` vs `pm = dynamic` didn't really changed much for long-lasting requests; it made smaller requests faster though, so it's a net w... [15:24:18] 10Operations, 10Citoid, 10Patch-For-Review, 10Service-deployment-requests, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10mobrovac) [15:24:22] 10Operations, 10ops-codfw, 10Services (watching): rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10mobrovac) [15:26:01] (03PS3) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) [15:27:45] (03CR) 10Vgutierrez: "pcc shows noop in lvs2006 and trimmed interface names in lvs2010: https://puppet-compiler.wmflabs.org/compiler1002/13551/" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [15:28:00] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10hashar) [15:28:27] (03CR) 10Imarlier: "Tagging a bunch of different folks as reviewers, since there isn't an obvious owner of this -- just looking for a sanity check." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [15:28:29] 10Operations, 10Graphite, 10Patch-For-Review, 10Performance-Team (Radar), 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10hashar) Zuul now emits stats to localhost which has statsite running :) [15:29:19] godog: zuul statsd confirmed to work. For Nodepool, I have stopped the service definitely yesterday :) [15:29:49] hashar: \o/ \o/ neat [15:31:17] 10Operations, 10monitoring, 10Patch-For-Review: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10hashar) The `gerrit.` metrics are actually reported by the Zuul service on contint1001. It corresponds to events receive from Gerrit such as patchsets,... [15:32:23] 10Operations, 10monitoring, 10Patch-For-Review: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10hashar) [15:36:22] (03PS4) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) [15:36:38] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) So I changed the MAC address of an-worker1080 to the 10G interface listed in the System Setup as "connected", and forced a PX... [15:38:31] (03CR) 10Vgutierrez: "again, NOOP in lvs2006 and expected changes in lvs2010: https://puppet-compiler.wmflabs.org/compiler1002/13552/" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [15:39:00] (03PS1) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [15:40:12] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [15:43:45] (03PS2) 10Dzahn: Partman: Add new ms-be systems [puppet] - 10https://gerrit.wikimedia.org/r/473810 (https://phabricator.wikimedia.org/T209395) (owner: 10Papaul) [15:43:46] 10Operations, 10Performance-Team: Evaluate scalability and performance of PHP7 compared to HHVM - https://phabricator.wikimedia.org/T206341 (10Imarlier) @Joe Might be interesting to look at specific calls that appear to perform less well, to see if we can identify specific calls that are slower. xhprof/tidewa... [15:45:25] Urbanecm: yea, i saw the tickets about zh-yue WP waiting to be renamed to just yue. so all seems right, thanks [15:45:39] Trey314159: ok, cool! glad that helped somewhat. i think i can still get more later [15:46:50] mutante: if you do find more later that could be helpful, so let me know—but I've started cleaning up what's there, so please don't overwrite anything! [15:46:59] !log rebooting debmonitor* instances for kernel security update and to pick up SSBD [15:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:42] (03CR) 10Dzahn: [C: 032] Partman: Add new ms-be systems [puppet] - 10https://gerrit.wikimedia.org/r/473810 (https://phabricator.wikimedia.org/T209395) (owner: 10Papaul) [15:48:57] Trey314159: ok! [15:52:47] jerkins, come on [15:55:40] no jenkins vote.. i wonder if that's related to the switch away from nodepool [15:55:58] (03PS1) 10Muehlenhoff: Remove Diamond from Swift backends [puppet] - 10https://gerrit.wikimedia.org/r/474280 (https://phabricator.wikimedia.org/T183454) [15:56:39] mutante: it's backlogged after earlier maintenance I think [15:56:50] Antoine mentioned that there's 1300 backlog or so [15:57:07] yeah, see SAL "restarting zuul with 1300 events to process" [15:57:29] moritzm: ACK, ok, thanks [15:58:21] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) I also tried to disable explicitly the integrated nic's boot option (and allow only the one from the 10G NIC) but didn't work... [15:58:39] 10Operations, 10Traffic, 10Wikimedia-Incident: Add maint-announce@ to Equinix's recipient list for eqsin incidents - https://phabricator.wikimedia.org/T207140 (10RobH) > Hi Rob > > > > I’m raising to our IT to check on this. > > Please bear with us while they investigate on it. > > > > Best regards... [15:59:50] (03CR) 10Dzahn: [V: 032 C: 032] "already had Verified +2 earlier" [puppet] - 10https://gerrit.wikimedia.org/r/473810 (https://phabricator.wikimedia.org/T209395) (owner: 10Papaul) [16:01:52] Jenkins is now escaping from WMF work and take a vacation? [16:02:20] rxy: its put in a lot of hours. it deserves it [16:03:20] (03CR) 10Dzahn: [C: 032] DHCP: ADD MAC address entries for ms-be204[4-9] and ms-be2050 [puppet] - 10https://gerrit.wikimedia.org/r/473817 (https://phabricator.wikimedia.org/T209395) (owner: 10Papaul) [16:03:27] (03PS2) 10Dzahn: DHCP: ADD MAC address entries for ms-be204[4-9] and ms-be2050 [puppet] - 10https://gerrit.wikimedia.org/r/473817 (https://phabricator.wikimedia.org/T209395) (owner: 10Papaul) [16:03:51] (03PS1) 10Ema: Revert "ATS: temporarily avoid calling 'verify_config' in ExecReload" [puppet] - 10https://gerrit.wikimedia.org/r/474283 [16:03:58] (03PS2) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [16:06:39] (03PS3) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [16:07:01] (03CR) 10Cwhite: [C: 031] Remove Diamond from Swift backends [puppet] - 10https://gerrit.wikimedia.org/r/474280 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:10:39] (03PS3) 10Dzahn: DNS: Add production and mgmt DNS entries for ms-be200[4-9] and ms-be2050 [dns] - 10https://gerrit.wikimedia.org/r/473646 (https://phabricator.wikimedia.org/T209395) (owner: 10Papaul) [16:13:22] (03PS2) 10Cwhite: diamond: remove diamond::collector::nginx [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) [16:13:31] (03PS1) 10Gehel: [WIP] wdqs: run test queries periodically on wdqs test servers [puppet] - 10https://gerrit.wikimedia.org/r/474285 (https://phabricator.wikimedia.org/T207665) [16:14:42] bawolff: Jenkins having many tasks. and now my patch is passed a tests. Thanks to Jenkins! [16:14:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wdqs: run test queries periodically on wdqs test servers [puppet] - 10https://gerrit.wikimedia.org/r/474285 (https://phabricator.wikimedia.org/T207665) (owner: 10Gehel) [16:14:55] (03CR) 10Gehel: "This still needs to extract the list of email addresses from somewhere." [puppet] - 10https://gerrit.wikimedia.org/r/474285 (https://phabricator.wikimedia.org/T207665) (owner: 10Gehel) [16:15:21] (03CR) 10Dzahn: [C: 032] DNS: Add production and mgmt DNS entries for ms-be200[4-9] and ms-be2050 [dns] - 10https://gerrit.wikimedia.org/r/473646 (https://phabricator.wikimedia.org/T209395) (owner: 10Papaul) [16:19:29] (03PS3) 10Cwhite: diamond: remove diamond::collector::nginx [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) [16:20:13] (03PS4) 10Cwhite: diamond: remove diamond::collector::nginx [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) [16:20:41] (03CR) 10Dzahn: "you could create a new mail alias, like wdqs@wikimedia.org in /srv/private/modules/privateexim/files/wikimedia.org on the puppetmaster . t" [puppet] - 10https://gerrit.wikimedia.org/r/474285 (https://phabricator.wikimedia.org/T207665) (owner: 10Gehel) [16:22:11] papaul: ready to go ahead with new ms-be installs now [16:22:20] already ran puppet on install2002 [16:22:24] (03CR) 10Gehel: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/474285 (https://phabricator.wikimedia.org/T207665) (owner: 10Gehel) [16:23:17] !log reindexing Chinese wikis on elastic@eqiad and elastic@codfw (T209156) [16:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:21] T209156: Re-index Chinese Wikis - https://phabricator.wikimedia.org/T209156 [16:25:41] (03PS5) 10Cwhite: diamond: remove diamond::collector::nginx [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) [16:26:00] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) Does the community want that, if there is a community of users on the incubator? [16:26:28] (03PS3) 10MSantos: maps: added use_proxy flag to set proxy [puppet] - 10https://gerrit.wikimedia.org/r/473731 (https://phabricator.wikimedia.org/T209570) (owner: 10Mathew.onipe) [16:26:30] (03PS3) 10MSantos: maps: update SQL script location for kartotherian [puppet] - 10https://gerrit.wikimedia.org/r/473736 (https://phabricator.wikimedia.org/T209566) (owner: 10Mathew.onipe) [16:31:13] (03PS1) 10Ema: ATS: add check_trafficserver_verify_config [puppet] - 10https://gerrit.wikimedia.org/r/474288 (https://phabricator.wikimedia.org/T204209) [16:32:31] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Dzahn) Yes, renaming of "zh-yue" to "yue" is stalled / lowest since 12 years or more. (2006 before Bugzilla?) T10217 and T30441 [16:32:34] 10Operations, 10Traffic: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [16:32:37] 10Operations, 10Traffic, 10Patch-For-Review: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) 05stalled>03Open We fixed the `verify_config` issue in ATS 8.0.0-1wm2, this is not stalled anymore. [16:35:19] (03CR) 10Jdlrobson: [C: 031] "It doesn't look like this interferes with the a/b test being run by reading web so ill be happy to see the results of this experiment!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [16:36:39] (03CR) 10Jdlrobson: [C: 031] wmf-config: Enable wgMFNoindexPages for 6 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [16:36:41] (03PS1) 10Paladox: Phabricator: Fix profile so that it uses hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/474291 [16:36:50] (03CR) 10Filippo Giunchedi: [C: 031] Remove Diamond from Swift backends [puppet] - 10https://gerrit.wikimedia.org/r/474280 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [16:39:34] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by volans on cumin2001.codfw.wmnet for hosts: ` rdb2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/201811161636_v... [16:42:22] (03PS2) 10Paladox: Phabricator: Fix profile so that it uses hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/474291 [16:42:24] (03PS4) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [16:42:54] 10Operations, 10Gerrit: Gerrit is down "502 Proxy Error" - https://phabricator.wikimedia.org/T209456 (10thcipriani) 05Open>03Resolved a:03thcipriani Two days later -- after adding in some better monitoring and blocking bots from indexing git blame for giant files with a lot of history -- we seem to be in... [16:47:27] (03PS5) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [16:52:51] (03PS6) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [16:53:11] (03PS7) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [17:00:26] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) @fgiunchedi I did the install on the first system ms-be2044 please check the output below. If it looks good for you let me know so i can resume... [17:00:52] (03CR) 10Bstorm: "Mostly some questions I have:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:02:44] (03CR) 10Bstorm: "Hmm. On the topic of making clush target part of toolforge base, maybe not? Is the grid master currently a clush target, for instance? " [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:03:44] (03CR) 10Bstorm: "Ooh! I've got it! Include the clush::target profile in select other roles instead of making it its own role! :)" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:09:24] (03CR) 10Cwhite: [C: 032] diamond: remove diamond::collector::nginx [puppet] - 10https://gerrit.wikimedia.org/r/473302 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [17:10:33] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1079.eqiad.wmnet'... [17:10:58] (03CR) 10GTirloni: toolforge: Refactor clush (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:11:28] (03PS8) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [17:17:23] PROBLEM - Host lvs2010 is DOWN: PING CRITICAL - Packet loss = 100% [17:19:45] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:59] (03CR) 10Bstorm: "I'm not worried if the clushmaster can control itself, but is the gridmaster and k8s master for instance currently under clush control?" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:21:51] RECOVERY - Host lvs2010 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [17:23:53] PROBLEM - Host lvs2010 is DOWN: PING CRITICAL - Packet loss = 100% [17:24:55] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [17:25:02] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) Chris solved the mistery! ` │ 17:22 ... [17:25:20] (03CR) 10Bstorm: "It's here: https://wikitech.wikimedia.org/wiki/Hiera:Tools" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:25:27] (03PS9) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [17:25:59] RECOVERY - PyBal BGP sessions are established on lvs2010 is OK: NaN https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=codfw%2520prometheus%252Fops [17:26:23] (03CR) 10GTirloni: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:29:11] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:27] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.26 ms [17:34:58] Hello, I think I found a possible hang in the abuse log search feature. The link is https://en.wikipedia.org/wiki/Special:AbuseLog?wpSearchUser=&wpSearchPeriodStart=&wpSearchPeriodEnd=&wpSearchTitle=&wpSearchImpact=0&wpSearchAction=any&wpSearchActionTaken=rangeblock&wpSearchFilter= [17:35:36] Hey all - anyone from Ops around that manages the MailMan installs? :) [17:35:43] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10cloud-services-team (Kanban): Phase out Nodepool from production - https://phabricator.wikimedia.org/T209361 (10hashar) [17:36:40] (03CR) 10Bstorm: toolforge: Refactor clush (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:36:56] varnent: I don't think there's one person and probably not one person who would want to admit it/claim it :) Maybe just ask and see what response you get? (Or file a task, of course: https://phabricator.wikimedia.org/project/view/190/ ) [17:37:38] Can anyone click on this and see if it times out for them too? https://en.wikipedia.org/wiki/Special:AbuseLog?wpSearchUser=&wpSearchPeriodStart=&wpSearchPeriodEnd=&wpSearchTitle=&wpSearchImpact=0&wpSearchAction=any&wpSearchActionTaken=rangeblock&wpSearchFilter= [17:37:45] Essentially - we want to archive/shut down WMFall per C-Lvl request - but not sure who best point person for that would be. Trust & Safety said they knew how - but I do not want to step on anyone's toes. :) [17:37:59] varnent: file a task :) [17:38:05] generally that's how work happens :P [17:38:08] varnent, the request should come from OIT [17:38:14] Wmfall list run by techsupport at wikimedia.org [17:38:29] Krenair: Turns out OIT does not do it as they asked me to do this. :) [17:38:39] and I was asked to do it off-Phab [17:38:44] OIT administrate that list [17:38:49] 10Operations, 10decommission, 10User-jijiki: Reclaim rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2001.codfw.wmnet'] ` and were **ALL** successful. [17:39:08] Krenair: You would have to bring this up with Eliza. That is who asked me to look into closing the list. [17:39:14] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10RobH) I've updated the firmware for bios/idrac/network on lvs2009 & lvs2010. lvs2007 & lvs2008 don't respond to mgmt interface connection attempts, and do not ping. Sho... [17:39:33] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:39:51] It doesn't sound like there's any owner then per se - so anyone with admin access to the list that does so is fine basically [17:39:53] varnent, list closure requests should surely come from the list's administrators [17:39:59] (03PS4) 10Imarlier: wmf-config: Enable wgMFNoindexPages for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) [17:41:11] Krenair: The request came from C-Lvls - so we're just trying to implement and I've been asked to make sure it happens today - so... [17:41:47] but if you're neither a list administrator nor a mailman administrator how can you ensure it happens today? [17:42:22] Krenair: I ask myself questions like that a lot - but it does not seem to stop me getting such requests from C-Lvl :) [17:42:29] (03CR) 10Imarlier: wmf-config: Enable wgMFNoindexPages for 6 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [17:43:01] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 (10RobH) I don't want to upload Dells firmware drivers to our systems (because I'm sure that is against some user agreements downloading the Dell software!) So I'll just li... [17:43:23] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1079.eqiad.wmnet'] ` and were **ALL** successful. [17:43:39] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:46:32] (03CR) 10GTirloni: toolforge: Refactor clush (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:50:04] (03CR) 10Bstorm: "> Patch Set 9:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) (owner: 10GTirloni) [17:51:14] I see from https://meta.wikimedia.org/wiki/Etherpad that it's "not suitable for long-term storage", but am curious to know if that's still the case? Is there a disk backup? Is this language just to set expectations around SLA? [17:52:45] I would be cautious and assume it is still the case [17:53:09] Really stuff that's long-term should be coming out of etherpad where it can be easily lost and put in wikis where it is discoverable [17:53:59] to work on some notes as a group it's great [17:54:46] +1 it makes sense, we're just calibrating our paranoia [17:55:57] I vaguely recall there have been cases of it loosing stuff before, I don't remember the details [17:56:51] https://wikitech.wikimedia.org/wiki/Incident_documentation/20160623-etherpad [17:56:57] this talks about some lost stuff [17:57:32] awight, ^ [17:58:32] (03PS10) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [17:59:12] Krenair: thanks, that's a helpful data point. Unforunately, a conclusion drawn from that case would be, that we should copy out of etherpad *immediately* after writing, but after a day or two we can consider the data safely preserved 8-) [18:00:35] haha [18:00:42] (03CR) 10Jdlrobson: [C: 031] wmf-config: Enable wgMFNoindexPages for 6 wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [18:01:05] that is hard to refute [18:03:23] personally I'd avoid trusting it with stuff that isn't easily reproducible [18:04:06] and I really don't like that we can't just get a list of all etherpads out there [18:04:27] anything important enough to be kept long-term in there should probably be stored on a wiki page [18:06:51] (03PS11) 10GTirloni: toolforge: Refactor clush [puppet] - 10https://gerrit.wikimedia.org/r/474277 (https://phabricator.wikimedia.org/T209701) [18:11:26] (03PS2) 10Ladsgroup: ores: Move all of celery configs to puppet [puppet] - 10https://gerrit.wikimedia.org/r/474158 (https://phabricator.wikimedia.org/T209587) [18:14:41] awight, judging from the history of the page, akosiaris or mutante might know more about this service [18:17:29] Krenair: > we can't just get a list of all etherpads -- the funny thing is, apparently attackers can get that list but regular users cannot. [18:19:01] I too would like to know more about that one :P [18:19:22] it might just be there to prevent people treating the unlisted nature of them as truly private [18:21:16] (03CR) 10Ladsgroup: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/13555/ores1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/474158 (https://phabricator.wikimedia.org/T209587) (owner: 10Ladsgroup) [18:34:32] (03PS1) 10Cwhite: nginx: remove diamond::collector::nginx reference [puppet/nginx] - 10https://gerrit.wikimedia.org/r/474309 (https://phabricator.wikimedia.org/T183454) [18:35:41] (03CR) 10Cwhite: "One thing I'm not sure of is how to tie this back to the main puppet repo because it is a submodule." [puppet/nginx] - 10https://gerrit.wikimedia.org/r/474309 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [18:41:05] (03PS3) 10Paladox: Phabricator: Make manifest user and pass configuable through hiera [puppet] - 10https://gerrit.wikimedia.org/r/474291 [18:41:12] (03PS4) 10Paladox: Phabricator: Make manifest user and pass configuable through hiera [puppet] - 10https://gerrit.wikimedia.org/r/474291 [18:44:09] (03PS5) 10Paladox: Phabricator: Make manifest user and pass configuable through hiera [puppet] - 10https://gerrit.wikimedia.org/r/474291 [18:44:14] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [18:45:45] (03PS6) 10Paladox: Phabricator: Make manifest user and pass configuable through hiera [puppet] - 10https://gerrit.wikimedia.org/r/474291 [18:45:49] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [18:45:59] (03PS1) 10BryanDavis: toolforge: purge jmail script [puppet] - 10https://gerrit.wikimedia.org/r/474311 (https://phabricator.wikimedia.org/T208579) [18:46:38] (03CR) 10jerkins-bot: [V: 04-1] toolforge: purge jmail script [puppet] - 10https://gerrit.wikimedia.org/r/474311 (https://phabricator.wikimedia.org/T208579) (owner: 10BryanDavis) [18:49:09] (03CR) 10Anomie: [C: 031] "Seems sane. Haven't tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473889 (https://phabricator.wikimedia.org/T206497) (owner: 10Imarlier) [18:49:45] (03PS7) 10Dzahn: Phabricator: Make manifest user and pass configuable through hiera [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [18:50:01] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [18:50:06] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [18:54:24] (03CR) 10Dzahn: [C: 031] "the story here is: since recently we are getting emails again for cloud VPS project owners when there is cron spam. that made us notice th" [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [19:06:05] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [19:06:47] (03PS2) 10BryanDavis: toolforge: purge jmail script [puppet] - 10https://gerrit.wikimedia.org/r/474311 (https://phabricator.wikimedia.org/T208579) [19:06:57] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [19:09:54] (03PS1) 10Rush: AlarmCounterLogster: add ability to check ip in range [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) [19:10:19] (03PS2) 10Rush: AlarmCounterLogster: add ability to check ip in range [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) [19:11:39] (03CR) 10jerkins-bot: [V: 04-1] AlarmCounterLogster: add ability to check ip in range [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [19:15:00] (03PS3) 10Rush: AlarmCounterLogster: add ability to check ip in range [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) [19:17:06] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [19:19:44] (03PS4) 10Rush: AlarmCounterLogster: add ability to check ip in range [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) [19:20:33] (03CR) 10jerkins-bot: [V: 04-1] AlarmCounterLogster: add ability to check ip in range [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [19:21:34] (03PS5) 10Rush: AlarmCounterLogster: add ability to check ip in range [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) [19:24:59] (03CR) 10Brian Wolff: [C: 031] "I dont speak python, but lgtm afaict" [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [19:26:24] (03CR) 10Rush: [C: 032] AlarmCounterLogster: add ability to check ip in range [puppet] - 10https://gerrit.wikimedia.org/r/474313 (https://phabricator.wikimedia.org/T208611) (owner: 10Rush) [19:35:47] (03PS1) 10Herron: kafka_shipper: use mmrm1stspace to remove leading space in msg field [puppet] - 10https://gerrit.wikimedia.org/r/474317 (https://phabricator.wikimedia.org/T206454) [19:40:03] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [19:40:44] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [19:40:45] (03PS1) 10Herron: kafka_shipper: update syslog json template [puppet] - 10https://gerrit.wikimedia.org/r/474319 (https://phabricator.wikimedia.org/T206454) [19:45:22] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Setup Kafka cluster, producers and consumers for logging pipeline - https://phabricator.wikimedia.org/T206454 (10herron) [19:50:38] PROBLEM - Check size of conntrack table on ms-be2045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.106: Connection reset by peer [19:50:39] PROBLEM - swift-container-updater on ms-be2045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.106: Connection reset by peer [19:51:16] ^ that would be servers that are installed and brandnew [19:51:19] and not in prod yet [19:51:26] * mutante double checks the number [19:52:05] yea, T209395 [19:52:05] T209395: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 [19:52:28] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.106: Connection reset by peer [19:52:28] PROBLEM - swift-object-auditor on ms-be2045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.106: Connection reset by peer [19:53:01] (03PS1) 10Herron: kafka_shipper: add apache2 to lookup table with kafka output [puppet] - 10https://gerrit.wikimedia.org/r/474320 (https://phabricator.wikimedia.org/T205852) [19:54:09] (03CR) 10Herron: "this is safe -- the profile using this lookup table is deployed only to one host currently and further rollout will be controlled and grad" [puppet] - 10https://gerrit.wikimedia.org/r/474320 (https://phabricator.wikimedia.org/T205852) (owner: 10Herron) [19:54:20] PROBLEM - configured eth on ms-be2045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.106: Connection reset by peer [19:54:20] PROBLEM - Check the NTP synchronisation status of timesyncd on ms-be2045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.106: Connection reset by peer [19:54:20] PROBLEM - swift-object-replicator on ms-be2045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.106: Connection reset by peer [19:55:38] RECOVERY - Check size of conntrack table on ms-be2045 is OK: OK: nf_conntrack is 0 % full [19:55:38] RECOVERY - swift-container-updater on ms-be2045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:56:22] RECOVERY - configured eth on ms-be2045 is OK: OK - interfaces up [19:56:22] RECOVERY - swift-object-replicator on ms-be2045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:56:28] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational [19:56:28] RECOVERY - swift-object-auditor on ms-be2045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:57:06] well, i was scheduling downtime for that, but then .. ok [20:08:23] (03CR) 10Dzahn: [C: 032] "noop for prod but fixes cloud and cron spam https://puppet-compiler.wmflabs.org/compiler1002/26/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [20:08:40] (03PS8) 10Dzahn: Phabricator: Make manifest user and pass configuable through hiera [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [20:10:40] (03PS9) 10Dzahn: Phabricator: Make manifest user and pass configurable through Hiera [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [20:12:40] (03CR) 10Dzahn: [C: 032] Phabricator: Make manifest user and pass configurable through Hiera [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [20:16:00] (03CR) 10Dzahn: [C: 032] "noop in prod confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/474291 (owner: 10Paladox) [20:23:33] (03PS1) 10Cwhite: role: add aggregations for TCP Fast Open to prometheus global [puppet] - 10https://gerrit.wikimedia.org/r/474321 (https://phabricator.wikimedia.org/T183454) [20:24:10] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1var-server=thumbor1004var-datasource=eqiad%2520prometheus%252Fops [20:24:22] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be2045 is OK: OK: synced at Fri 2018-11-16 20:24:21 UTC. [20:27:08] (03PS2) 10Dzahn: stdlib: import useful data types (filemode,filesource,fqdn,host,port) [puppet] - 10https://gerrit.wikimedia.org/r/472363 [20:28:50] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Urbanecm) >>! In T209693#4754087, @ArielGlenn wrote: > Does the community want that, if there is a community of users on the incubator? I think explicit bl... [20:29:54] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10ArielGlenn) Are there other zh-yue and yue projects that also need to be addressed? I f we are going to add redirects, we might as well do all that are needed. [20:30:19] (03CR) 10Dzahn: [C: 031] icinga: manage permissions for replicated files [puppet] - 10https://gerrit.wikimedia.org/r/473789 (https://phabricator.wikimedia.org/T208824) (owner: 10Cwhite) [20:31:42] (03PS4) 10Cwhite: icinga: manage permissions for replicated files [puppet] - 10https://gerrit.wikimedia.org/r/473789 (https://phabricator.wikimedia.org/T208824) [20:32:23] (03CR) 10Cwhite: [C: 032] icinga: manage permissions for replicated files [puppet] - 10https://gerrit.wikimedia.org/r/473789 (https://phabricator.wikimedia.org/T208824) (owner: 10Cwhite) [20:35:14] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:35:22] (03PS1) 10Sbisson: Labs: setting privacy statement url for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474325 (https://phabricator.wikimedia.org/T209725) [20:36:50] (03PS5) 10Niedzielski: Doc: add repoConceptBaseUri comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) [20:38:55] (03PS6) 10Niedzielski: Doc: add repoConceptBaseUri comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) [20:39:25] (03CR) 10Niedzielski: "Thanks, Lucas. Revised." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473292 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [20:39:35] 10Operations, 10Wikimedia-Mailing-lists: Need to shut down a list - https://phabricator.wikimedia.org/T209726 (10Beeblebrox) [20:41:48] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [20:42:00] (03CR) 10Niedzielski: "I'm a little bit at a loss for whether HTTP or HTTPS is preferable so I'm leaving it as is. On one hand, I hear your point about stability" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473835 (https://phabricator.wikimedia.org/T209352) (owner: 10Niedzielski) [20:43:11] (03PS1) 10Sbisson: Enable and configure Welcome survey on kowiki and cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474331 (https://phabricator.wikimedia.org/T209725) [20:47:39] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Dzahn) Wikipedia: zh-classical zh-min-nan zh-yue Wiktionary: [20:49:04] (03PS2) 10Cwhite: ntp: ensure absent ntpd diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/473295 (https://phabricator.wikimedia.org/T183454) [20:50:22] 10Operations, 10Wikimedia-Mailing-lists: Need to shut down a list - https://phabricator.wikimedia.org/T209726 (10Bawolff) [You should probably include the name of the mailing list in the task title] [20:51:34] bawolff: you can edit it :) [20:52:01] Yeah but i dont know the name of the list [20:53:46] (03CR) 10Cwhite: [C: 032] ntp: ensure absent ntpd diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/473295 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [20:56:03] 10Operations, 10Wikimedia-Apache-configuration: Redirect from zh-yue.wiktionary.org is not working properly - https://phabricator.wikimedia.org/T209693 (10Dzahn) re: community desire: T30441#2637271 -> "Many editors of Cantonese Wikipedia have been watching this thread for 9 years" [20:56:05] (03CR) 10Catrope: [C: 032] Labs: setting privacy statement url for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474325 (https://phabricator.wikimedia.org/T209725) (owner: 10Sbisson) [20:58:15] bawolff: oh, I thought I read it in the description the first time I quickkly skimmed, guess not :) [20:58:43] (03Merged) 10jenkins-bot: Labs: setting privacy statement url for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474325 (https://phabricator.wikimedia.org/T209725) (owner: 10Sbisson) [20:59:27] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10thcipriani) @RobH would you know if it's possible to add physical storage to this machine? If not we'll have to work... [21:02:20] (03CR) 10jenkins-bot: Labs: setting privacy statement url for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474325 (https://phabricator.wikimedia.org/T209725) (owner: 10Sbisson) [21:02:45] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10RobH) contint1001 was purchased on T130738, and has dual 1TB SATA disks. We generally don't store anything in the /... [21:03:01] (03PS1) 10Dzahn: update puppet stdlib to upstream release branch [puppet] - 10https://gerrit.wikimedia.org/r/474334 [21:04:15] (03CR) 10jerkins-bot: [V: 04-1] update puppet stdlib to upstream release branch [puppet] - 10https://gerrit.wikimedia.org/r/474334 (owner: 10Dzahn) [21:05:37] (03CR) 10Dzahn: "i understand the concerns to cherry pick some things, here is what happens if i update to the upstream release branch: https://gerrit.wiki" [puppet] - 10https://gerrit.wikimedia.org/r/472363 (owner: 10Dzahn) [21:07:08] (03CR) 10Dzahn: "originally i just wanted to import some more useful data types, but after the comments on https://gerrit.wikimedia.org/r/#/c/operations/pu" [puppet] - 10https://gerrit.wikimedia.org/r/474334 (owner: 10Dzahn) [21:09:42] (03CR) 10Dzahn: "is the release branch to new to use with our puppet 4.8.2 ?" [puppet] - 10https://gerrit.wikimedia.org/r/474334 (owner: 10Dzahn) [21:10:25] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install new ms-be servers ms-be204[4-9] ,ms-be2050 - https://phabricator.wikimedia.org/T209395 (10Papaul) [21:20:39] PROBLEM - swift-account-server on ms-be2046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.77: Connection reset by peer [21:22:03] ms-be2046 is me [21:22:31] PROBLEM - swift-container-auditor on ms-be2046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.77: Connection reset by peer [21:24:19] PROBLEM - MD RAID on ms-be2046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.77: Connection reset by peer [21:24:20] PROBLEM - swift-container-replicator on ms-be2046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.77: Connection reset by peer [21:24:24] 10Operations, 10ops-codfw: Degraded RAID on ms-be2046 - https://phabricator.wikimedia.org/T209727 (10ops-monitoring-bot) [21:26:09] PROBLEM - swift-container-server on ms-be2046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.77: Connection reset by peer [21:27:41] RECOVERY - swift-account-server on ms-be2046 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:28:02] (03PS2) 10Dzahn: update puppet stdlib to upstream release branch [puppet] - 10https://gerrit.wikimedia.org/r/474334 [21:28:09] RECOVERY - swift-container-server on ms-be2046 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [21:28:21] RECOVERY - swift-container-replicator on ms-be2046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [21:28:21] RECOVERY - MD RAID on ms-be2046 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [21:28:35] RECOVERY - swift-container-auditor on ms-be2046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:28:46] (03CR) 10jerkins-bot: [V: 04-1] update puppet stdlib to upstream release branch [puppet] - 10https://gerrit.wikimedia.org/r/474334 (owner: 10Dzahn) [21:33:19] (03CR) 10Dzahn: "the part i care about is just the ./types/ directory" [puppet] - 10https://gerrit.wikimedia.org/r/474334 (owner: 10Dzahn) [21:34:26] (03CR) 10Dzahn: "and not sure what would fix the jenkins -1" [puppet] - 10https://gerrit.wikimedia.org/r/474334 (owner: 10Dzahn) [21:35:10] (03PS2) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [21:35:25] PROBLEM - puppet last run on ms-be2046 is CRITICAL: CRITICAL: Puppet has 24 failures. Last run 7 minutes ago with 24 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdm],Exec[parted-/dev/sdn],Exec[parted-/dev/sdc],Exec[parted-/dev/sdd] [21:35:45] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:38:24] (03CR) 10Paladox: [C: 04-1] phabricator: add data types to all parameters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:38:53] (03CR) 10Dzahn: [C: 04-1] "lol, this manages to cause an exception in puppet-lint itself, that's special bonus!" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:39:25] (03CR) 10Paladox: [C: 04-1] phabricator: add data types to all parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:45:16] (03PS3) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [21:46:27] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:46:31] (03CR) 10Paladox: [C: 04-1] phabricator: add data types to all parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:47:32] Trey314159: check again on mwamint1002, a new "restored" dir in your home. that is from Bacula now, from mwmaint1001, there is reindex and also dot files [21:48:29] (03CR) 10Dzahn: [C: 04-1] phabricator: add data types to all parameters (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:49:24] (03CR) 10Dzahn: phabricator: add data types to all parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:49:49] (03PS4) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [21:50:41] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:52:25] (03PS5) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [21:55:56] (03PS6) 10Paladox: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [21:56:01] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [22:00:06] (03PS1) 10Takidelfin: InitaliseSettings: Remove redundant namespace talks definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 [22:03:55] (03PS2) 10Takidelfin: InitialiseSettings: Remove redundant namespace talks definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 [22:11:42] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474372 (owner: 10Takidelfin) [22:20:31] (03PS2) 10Andrew Bogott: nodepool: labtestservices2003 is not used for testing [puppet] - 10https://gerrit.wikimedia.org/r/473834 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [22:35:22] (03CR) 10Andrew Bogott: [C: 032] nodepool: labtestservices2003 is not used for testing [puppet] - 10https://gerrit.wikimedia.org/r/473834 (https://phabricator.wikimedia.org/T209361) (owner: 10Hashar) [22:36:27] (03CR) 10Andrew Bogott: [C: 031] Make labnodepool1001.eqiad.wmnet a spare system [puppet] - 10https://gerrit.wikimedia.org/r/473838 (https://phabricator.wikimedia.org/T209642) (owner: 10Hashar) [22:36:41] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/13558/phab1001.eqiad.wmnet/change.phab1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [22:45:16] (03PS7) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [23:01:20] (03PS1) 10Dzahn: add missing fake passwords for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/474378 [23:03:27] (03PS2) 10Dzahn: add missing fake passwords for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/474378 [23:05:13] (03CR) 10Dzahn: [V: 032 C: 032] add missing fake passwords for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/474378 (owner: 10Dzahn) [23:06:05] (03CR) 10Dzahn: "needed https://gerrit.wikimedia.org/r/#/c/labs/private/+/474378/" [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [23:09:26] (03PS1) 10Dzahn: add missing gerritbot_token variable for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/474381 [23:09:50] (03CR) 10Dzahn: [V: 032 C: 032] add missing gerritbot_token variable for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/474381 (owner: 10Dzahn) [23:12:00] PROBLEM - WDQS HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.002 second response time [23:12:47] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@ee91c41]: Deploy on test wdqs1010 [23:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:08] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:13:10] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@ee91c41]: Deploy on test wdqs1010 (duration: 00m 23s) [23:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:14] RECOVERY - WDQS HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.028 second response time [23:15:24] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [23:17:17] (03PS8) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [23:18:23] (03CR) 10jerkins-bot: [V: 04-1] phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 (owner: 10Dzahn) [23:24:18] (03PS1) 10Dzahn: fix typo in gerritbot_token password variable for phab [labs/private] - 10https://gerrit.wikimedia.org/r/474383 [23:25:06] (03CR) 10Dzahn: [V: 032 C: 032] fix typo in gerritbot_token password variable for phab [labs/private] - 10https://gerrit.wikimedia.org/r/474383 (owner: 10Dzahn) [23:34:26] (03PS9) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [23:36:48] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:40:10] PROBLEM - WDQS HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.004 second response time [23:41:20] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [23:42:29] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10kruusamagi) For me, it seems that the issue has grown even bigger in time. The delay with Estonian Wikipedia is often like 3 weeks (!!!), that means not... [23:43:44] (03PS1) 10Dzahn: add missing metrics_user/metrics_pass for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/474386 [23:44:39] (03CR) 10Dzahn: [V: 032 C: 032] add missing metrics_user/metrics_pass for phabricator [labs/private] - 10https://gerrit.wikimedia.org/r/474386 (owner: 10Dzahn) [23:45:02] (03PS10) 10Dzahn: phabricator: add data types to all parameters [puppet] - 10https://gerrit.wikimedia.org/r/471325 [23:46:51] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Bawolff) >>! In T119366#4754971, @kruusamagi wrote: > For me, it seems that the issue has grown even bigger in time. The delay with Estonian Wikipedia i... [23:48:04] PROBLEM - WDQS HTTP Port on wdqs1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.002 second response time [23:49:42] 10Operations, 10monitoring, 10Patch-For-Review: rename tegmen to icinga2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T208824 (10Dzahn) edited https://netbox.wikimedia.org/dcim/devices/1705/ and renamed in Netbox and set to status "Active" instead of "Inventory". [23:49:55] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Bawolff) Fwiw: im of the opinion that date magic words should reduce varnish cache to at least 24 hours, maybe six hours. Im doubtful that super long ca... [23:52:56] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Concerns about icinga1001 check latency - https://phabricator.wikimedia.org/T208066 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4 has an average latency of 7.153 sec and that is better or the same as before, after our tun... [23:55:12] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [23:58:12] RECOVERY - WDQS HTTP Port on wdqs1010 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.015 second response time [23:58:37] 10Operations, 10Icinga, 10decommission, 10monitoring: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Dzahn) [23:58:52] 10Operations, 10Icinga, 10decommission, 10monitoring: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Dzahn) 05Open>03stalled p:05Triage>03Normal [23:58:54] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) [23:59:48] 10Operations, 10Icinga, 10decommission, 10monitoring: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10Dzahn) changed netbox status from Active to Staged https://netbox.wikimedia.org/dcim/devices/1592/