[00:01:28] (03CR) 10Dzahn: [C: 032] admins: add bitpogo and tieu to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/429460 (https://phabricator.wikimedia.org/T191523) (owner: 10Dzahn) [00:21:34] !log increase cluster.routing.allocation.balance.threshold from 1.0 to 1.5 for eqiad elasticsearch cluster to reduce rebalancing agressiveness [00:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:54] 10Operations, 10media-storage: upload.wikimedia.org should serve a Wikimedia 404 error page when file not found in Swift - https://phabricator.wikimedia.org/T37053#4170142 (10Krinkle) [00:51:08] (03CR) 10Krinkle: wmfusercontent.org: add SPF record to disable email (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [00:51:25] (03CR) 10Krinkle: [C: 031] w.wiki: add SPF record, disallow email [dns] - 10https://gerrit.wikimedia.org/r/429871 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [01:09:28] (03CR) 10Rxy: ">" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [01:29:03] (03CR) 10Rxy: "> >" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [01:29:48] (03CR) 10Rxy: "> > >" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [02:21:31] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4170259 (10Krinkle) [02:22:33] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#3049047 (10Krinkle) [02:47:03] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.1) (duration: 07m 04s) [02:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:54] (03PS1) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) [03:27:32] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 713.36 seconds [03:29:33] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 21.28, 22.21, 23.86 [04:16:12] PROBLEM - mediawiki-installation DSH group on mw2173 is CRITICAL: Host mw2173 is not in mediawiki-installation dsh group [04:17:42] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 264.94 seconds [05:53:50] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4170441 (10Marostegui) Thank you @Cmjohnson I have started MySQL to let it catch up and replicate for a day before repooling it. MySQL and kernel have been upgraded too [05:56:08] (03PS1) 10Marostegui: Revert "db1098.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/430032 [05:56:14] (03PS2) 10Marostegui: Revert "db1098.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/430032 [05:56:37] (03CR) 10Marostegui: "This can me merged once the server is ready to be repooled" [puppet] - 10https://gerrit.wikimedia.org/r/430032 (owner: 10Marostegui) [05:58:49] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4170443 (10Marostegui) This can be merged once we are ready to repool: https://gerrit.wikimedia.org/r/#/c/430032/ [06:29:42] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:31:02] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:12] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:56:02] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:56:12] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:42] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:17:33] (03CR) 10jenkins-bot: Don't try to set wgSiteSupportPage, ignored for a decade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428365 (https://phabricator.wikimedia.org/T192467) (owner: 10Jforrester) [07:17:37] (03CR) 10jenkins-bot: Set SPARQL services to use internal cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429853 (https://phabricator.wikimedia.org/T192942) (owner: 10Smalyshev) [07:17:42] (03CR) 10jenkins-bot: Set $wgKartographerUsePageLanguage to false everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429985 (https://phabricator.wikimedia.org/T192955) (owner: 10Catrope) [07:33:12] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 66.27, 35.22, 27.01 [07:35:13] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 28.20, 30.99, 26.46 [08:02:27] (03CR) 10Gilles: Reafactor varnishlog consumers (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (owner: 10Gilles) [08:33:52] PROBLEM - HHVM rendering on mw2137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:34:42] RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 74396 bytes in 0.304 second response time [09:19:22] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.098 second response time [09:30:21] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4170558 (10alanajjar) @Pchelolo @mobrovac @Tgr Would be helpful if we create tracking task for stuck renames? (like T169440) [09:33:42] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4170563 (10mobrovac) >>! In T193254#4169720, @Tgr wrote: > So we should probably get the core bug fixed. I filed {T193471} for this. >>! In T193254#4169741... [09:35:43] (03CR) 10Volans: [C: 04-1] "Thanks for the quick replies, comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (owner: 10Gilles) [09:37:29] (03Draft2) 10Matěj Suchánek: Update Wikidata property blacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430045 [09:39:22] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1967 bytes in 0.123 second response time [09:52:56] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473#4170599 (10Gehel) [09:56:02] (03PS1) 10Gehel: wdqs: remove PrivateTmp option from wdqs-blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/430049 [10:01:17] [Wug5YQpAAEEAACY@lycAAADM] 2018-05-01 09:54:47: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" [10:01:37] To avoid creating high replication lag, this transaction was aborted because the write duration (3.330155134201) exceeded the 3 second limit [10:01:48] just trying to move a page... [10:02:39] ok, now it works?? [10:09:25] (03PS1) 10Mobrovac: Recommendation API: Migrate to the new WDQS internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) [10:11:46] 10Operations, 10Analytics, 10EventBus, 10GlobalRename, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4170642 (10Tgr) >>! In T193254#4170563, @mobrovac wrote: > I filed {T193471} for this. Thanks! > I think we should go ahead and switch these two for all wi... [10:16:19] (03CR) 10Mobrovac: "PCC OK: https://puppet-compiler.wmflabs.org/compiler02/11084/" [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) (owner: 10Mobrovac) [10:20:12] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 62233 MB (12% inode=99%) [10:21:12] RECOVERY - Disk space on elastic1018 is OK: DISK OK [11:43:02] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 45.46, 35.96, 32.10 [12:03:18] (03CR) 10Gilles: Reafactor varnishlog consumers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/429843 (owner: 10Gilles) [12:09:25] 10Operations, 10Performance-Team, 10Traffic: Refactor varnishospital and varnishslowlog - https://phabricator.wikimedia.org/T193489#4170862 (10Gilles) [12:09:39] 10Operations, 10Performance-Team, 10Traffic: Refactor varnishospital and varnishslowlog - https://phabricator.wikimedia.org/T193489#4170874 (10Gilles) p:05Triage>03Normal [12:13:03] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 30.75, 30.26, 32.11 [12:16:02] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 34.20, 31.26, 32.11 [12:24:22] PROBLEM - HHVM rendering on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:25:13] RECOVERY - HHVM rendering on mw2206 is OK: HTTP OK: HTTP/1.1 200 OK - 74470 bytes in 0.306 second response time [12:45:22] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) (owner: 10Mobrovac) [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180501T1300). Please do the needful. [13:00:04] No GERRIT patches in the queue for this window AFAICS. [13:02:13] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 38.20, 33.48, 32.04 [13:26:52] PROBLEM - Check that eventlogging-service-eventbus is running on kafka1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus [13:28:02] PROBLEM - Check systemd state on kafka1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:28:33] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties [13:29:57] PROBLEM - Kafka Broker Server on kafka1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties [13:30:14] agaIN!? [13:30:25] AHH i did it again [13:30:27] typed the fqdn [13:30:28] sheez [13:30:30] <_joe_> ottomata: you around and taking care of this? [13:30:31] <_joe_> ahahah [13:30:35] for the downtime [13:30:39] <_joe_> you forgot to downtime it [13:30:47] didn't forget [13:30:49] <_joe_> or well [13:30:51] but you can't type the fqdn [13:30:51] yeah [13:30:53] <_joe_> you used the fqdn [13:30:55] yea [13:30:58] <_joe_> it bit me once [13:30:58] did it yesterday too [13:31:05] i even thought about it before i typed it [13:31:08] ok I see it's all under control :D [13:31:09] but then i did a couple of other things [13:31:14] i did it again! [13:31:16] sorry volans|off gahhh! [13:31:17] <_joe_> I was downtiming 38 hosts that I was powering off :D [13:31:51] ottomata: no worries lol [13:32:15] (03CR) 10Ottomata: [C: 032] Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans) [13:32:20] (03PS9) 10Ottomata: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans) [13:32:34] (03CR) 10Ottomata: [V: 032 C: 032] Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans) [13:34:33] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 9.00, 13.45, 23.14 [13:35:47] !log restarted hhvm on mw1233 [13:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:40] (03PS1) 10Gehel: elasticsearch: raise alerting limit for free disk space [puppet] - 10https://gerrit.wikimedia.org/r/430066 (https://phabricator.wikimedia.org/T192972) [13:51:22] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.099 second response time [13:53:44] (03PS1) 10Fdans: Small improvements to the geoip archive script [puppet] - 10https://gerrit.wikimedia.org/r/430067 (https://phabricator.wikimedia.org/T136732) [13:56:22] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.081 second response time [13:59:04] (03PS1) 10Anomie: Raise Scribunto maxLangCacheSize to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430068 (https://phabricator.wikimedia.org/T85461) [13:59:36] (03PS1) 10R4q3NWnUx2CEhVyr: Change to use VUT [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/430069 [13:59:50] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [14:04:20] RECOVERY - Check systemd state on kafka1003 is OK: OK - running: The system is fully operational [14:09:00] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad@0 on kafka1003 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-codfw_to_main-eqiad@0/producer\.properties [14:09:16] RECOVERY - Kafka Broker Server on kafka1003 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties [14:17:22] (03PS1) 10Ottomata: Revert "Temporarly remove partman recipe for kafka main hosts" [puppet] - 10https://gerrit.wikimedia.org/r/430072 [14:17:29] (03PS2) 10Ottomata: Revert "Temporarly remove partman recipe for kafka main hosts" [puppet] - 10https://gerrit.wikimedia.org/r/430072 [14:17:32] (03CR) 10Ottomata: [V: 032 C: 032] Revert "Temporarly remove partman recipe for kafka main hosts" [puppet] - 10https://gerrit.wikimedia.org/r/430072 (owner: 10Ottomata) [14:36:10] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408#4171089 (10RobH) p:05Triage>03Normal I'm attempting to set the priority for all unassigned SRE/Ops tasks under #operations. This appears to be a normal (or possibly high) priority. [14:36:37] 10Operations, 10Wikispeech, 10Wikispeech-WMSE: TTS server deployment strategy - https://phabricator.wikimedia.org/T193072#4171091 (10RobH) p:05Triage>03Normal [14:36:56] 10Operations: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457#4171092 (10RobH) p:05Triage>03High [14:38:03] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394#4171095 (10RobH) p:05Triage>03High This host is under warranty until 2019-03-14, and is an HP DL360. [14:42:41] (03PS2) 10Ottomata: Small improvements to the geoip archive script [puppet] - 10https://gerrit.wikimedia.org/r/430067 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [14:53:44] 10Operations, 10Cloud-Services: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4171118 (10chasemp) p:05Triage>03Normal [14:54:00] (03CR) 10Herron: [C: 032] w.wiki: add SPF record, disallow email [dns] - 10https://gerrit.wikimedia.org/r/429871 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [14:55:04] 10Operations, 10Cloud-Services: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4171118 (10chasemp) I have previously talked with @ayounsi about this and promised a task weeks ago :) I did assign this but only bc of that and I know @ayounsi is the human who can hel... [14:56:06] 10Operations, 10Cloud-Services, 10netops: Consider renumbering Labs to separate address spaces - https://phabricator.wikimedia.org/T122406#4171135 (10chasemp) {T193496} is related [14:56:07] (03PS5) 10Herron: icinga-sms: use localhost as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/429344 (https://phabricator.wikimedia.org/T82937) [15:03:32] 10Operations, 10Cloud-Services, 10netops: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496#4171173 (10mark) [15:18:31] 10Operations, 10Wikimedia-Apache-configuration, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4171224 (10RobH) p:05Triage>03Normal [15:19:17] (03PS1) 10Framawiki: Enable wgCiteResponsiveReferences on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430075 (https://phabricator.wikimedia.org/T193491) [15:22:45] 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4171244 (10Jdforrester-WMF) [15:27:38] (03CR) 10Jforrester: [C: 031] "Happy for this to go out whenever kowiki wants." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430075 (https://phabricator.wikimedia.org/T193491) (owner: 10Framawiki) [15:48:32] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1967 bytes in 0.099 second response time [15:48:43] (03PS3) 10Fdans: Small improvements to the geoip archive script [puppet] - 10https://gerrit.wikimedia.org/r/430067 (https://phabricator.wikimedia.org/T136732) [15:50:12] !log restbase: begin culling leaked revisions, others_T_mobile__ng_lead -- T192689 [15:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:16] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [15:53:32] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1945 bytes in 0.118 second response time [15:58:02] 10Operations, 10cloud-services-team, 10monitoring: Prometheus vs. CPU usage vs. hyperthreading - https://phabricator.wikimedia.org/T193272#4171307 (10chasemp) https://www.percona.com/blog/2015/01/15/hyper-threading-double-cpu-throughput/ http://perfdynamics.blogspot.com/2014/01/monitoring-cpu-utilization-un... [16:00:05] godog, moritzm, and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180501T1600). [16:00:05] eddiegp: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:19] o/ [16:01:03] 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4171313 (10ssastry) >>! In T193414#4171156, @Anomie wrote: > Until we sw... [16:02:09] 10Operations, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Parsing-Team, and 2 others: Different production servers have different versions of tidy installed, resulting in varying output - https://phabricator.wikimedia.org/T193414#4171319 (10ssastry) >>! In T193414#4171220, @Jdforrester-WMF wrote: > Pr... [16:02:17] (03PS1) 10Ottomata: icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079 [16:03:05] (03PS2) 10Ottomata: icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079 [16:03:15] in the past ~20 min 4 elastic servers have started warning on /srv disk space. expected? [16:06:40] gehel: ebernhardson ^ [16:07:30] herron: yep, expected, I'll ack them again, we're discussing what we should really do in #wikimedia-discovery if you are interested! [16:07:57] cool thanks! [16:08:08] herron: sorry for the noise, and thanks for the ping! [16:10:28] jouncebot: now [16:10:28] For the next 0 hour(s) and 49 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180501T1600) [16:10:52] Anyone willing to do that? ^ [16:12:33] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services, and 2 others: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4140086 (10Pchelolo) It seems this task got derailed completely from the original purpose. The original issue regarding `ObjectCac... [16:15:31] !log manually kicked off mirror@sodium:~$ /usr/local/sbin/update-ubuntu-mirror to clear ubuntu mirror out of sync alert [16:15:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:55] !log T192972 change eqiad elasticsearch disk watermarks from 85/85 to 80/80 to match disk space alerts [16:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:59] T192972: Evaluate impact of adding ~2700 new shards to production cluster - https://phabricator.wikimedia.org/T192972 [16:25:36] Puppet swat - anyone? :) [16:30:11] eddiegp: checking... [16:30:19] (03PS3) 10Gehel: mediawiki: Remove updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425968 (https://phabricator.wikimedia.org/T192139) (owner: 10EddieGP) [16:31:59] eddiegp: sounds trivial enough, I'll merge [16:32:05] gehel: Thanks :) [16:32:09] running puppet compiler first, just in case... [16:32:09] (03PS1) 10Alexandros Kosiaris: prometheus: Add buddyinfo collector to node exporter [puppet] - 10https://gerrit.wikimedia.org/r/430082 [16:32:57] Yeah, ... T192532 [16:32:58] T192532: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532 [16:33:12] (03CR) 10Gehel: [C: 032] mediawiki: Remove updateArticleCount cron [puppet] - 10https://gerrit.wikimedia.org/r/425968 (https://phabricator.wikimedia.org/T192139) (owner: 10EddieGP) [16:33:36] eddiegp: yeah as well! that would be nice! [16:36:27] eddiegp: still running puppet on terbium / wasat to make sure all is green [16:36:45] (03CR) 10Dzahn: wmfusercontent.org: add SPF record to disable email (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [16:36:47] (03PS2) 10Dzahn: wmfusercontent.org: add SPF record to disable email [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) [16:37:27] eddiegp: and puppet completed with no error [16:37:33] 10Operations, 10Patch-For-Review, 10Wikimedia-maintenance-script-run: Remove monthly run of updateArticleCount.php - https://phabricator.wikimedia.org/T192139#4171430 (10EddieGP) 05Open>03Resolved All cleanup done :-) [16:37:47] and no change either... [16:37:56] gehel: Happy to hear that! [16:38:04] eddiegp: so am I! [16:38:15] 10Operations, 10monitoring, 10User-fgiunchedi: Better organization for ops grafana dashboards - https://phabricator.wikimedia.org/T178690#4171433 (10akosiaris) >>! In T178690#4168673, @Volans wrote: > As discussed in the monitoring meeting here some feedback: > > - while the limit on the number of rows/pane... [16:38:37] (03CR) 10Dzahn: "thanks Krinkle and Rxy: i moved it up to the "Mail exchangers" section." [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [16:46:04] (03CR) 10Anomie: "2 errors, 1 suggestion, each repeated for a few tables." (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [16:46:13] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services, and 2 others: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4171459 (10EddieGP) >>! In T192473#4171334, @Pchelolo wrote: > The Kafka queue works correctly overall as far as I can tell. kafka... [16:46:20] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services, and 2 others: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4171461 (10EddieGP) 05Open>03Resolved a:03EddieGP [16:58:34] 10Operations, 10Beta-Cluster-Infrastructure, 10DBA: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115#4171517 (10EddieGP) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180501T1700). [17:00:19] no parsoid deploy today [17:13:41] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10hardware-requests: Give misc dump crons their own host - https://phabricator.wikimedia.org/T181936#4171570 (10RobH) 05Open>03stalled a:05ArielGlenn>03None This has been approved for order via T190112. As such, I'm setting this to s... [17:27:59] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.129 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [17:28:36] (03CR) 10Smalyshev: [C: 031] wdqs: remove PrivateTmp option from wdqs-blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/430049 (owner: 10Gehel) [17:29:06] 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4171600 (10Gehel) a:03Gehel [17:29:29] 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4160049 (10debt) @Gehel will write up some more information... [17:29:31] 10Operations, 10CirrusSearch, 10Discovery, 10Search-Platform-Programs, 10Discovery-Search (Current work): Jobs writing to the Elasticsearch cluster in codfw are timing out, causing all type of issues - https://phabricator.wikimedia.org/T193112#4171604 (10debt) @Gehel will write up some more information... [17:30:56] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [17:31:31] 10Operations, 10Discovery-Search (Current work): search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#4171609 (10Smalyshev) [17:31:55] 10Operations, 10Discovery-Search (Current work): search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#4171610 (10debt) a:05debt>03None Removing myself as I unfortunately won't be able to help. Whomever is free next will take it. [17:32:56] PROBLEM - NTP peers on dns5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:46] RECOVERY - NTP peers on dns5001 is OK: NTP OK: Offset 0.000134 secs [17:36:32] !log otto@tin Started deploy [eventlogging/eventbus@c70e8c5]: remove occasional logging of request.body in prep for T193230 [17:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:37] T193230: EventBus HTTP Proxy service does not report errors to logstash - https://phabricator.wikimedia.org/T193230 [17:38:30] (03PS4) 10Ottomata: Log eventlogging-service-eventbus logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429865 (https://phabricator.wikimedia.org/T193230) [17:38:58] (03CR) 10jerkins-bot: [V: 04-1] Log eventlogging-service-eventbus logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429865 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [17:39:01] !log otto@tin Finished deploy [eventlogging/eventbus@c70e8c5]: remove occasional logging of request.body in prep for T193230 (duration: 02m 29s) [17:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:35] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61210 MB (12% inode=99%) [17:40:38] !log disabling eqsin<->codfw link for high packet loss on link [17:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:46] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11086/" [puppet] - 10https://gerrit.wikimedia.org/r/429865 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [17:40:47] (03CR) 10Ottomata: [V: 032 C: 032] Log eventlogging-service-eventbus logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/429865 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [17:42:20] looking at disk space on elastic1025, that's definitely higher than it should be even with the recent changes [17:45:21] there is a shard leaving elastic1025 atm, so back to normal soon (probably...) [17:45:33] (03PS1) 10Ottomata: eventbus - Install python-logstash if using logstash logging [puppet] - 10https://gerrit.wikimedia.org/r/430092 (https://phabricator.wikimedia.org/T193230) [17:47:17] (03CR) 10Ottomata: [C: 032] eventbus - Install python-logstash if using logstash logging [puppet] - 10https://gerrit.wikimedia.org/r/430092 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [17:48:35] RECOVERY - Disk space on elastic1025 is OK: DISK OK [17:48:41] gehel: out of curiosity what are you using to watch shard activity? [17:49:56] herron: https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles-prometheus?panelId=64&fullscreen&orgId=1&from=now-7d&to=now [17:50:02] for the high level view [17:50:27] (03CR) 10Smalyshev: [C: 031] Recommendation API: Migrate to the new WDQS internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/430052 (https://phabricator.wikimedia.org/T190266) (owner: 10Mobrovac) [17:50:42] and `watch -d -n 60 'curl -s localhost:9200/_cat/recovery?h=index,time,type,stage,source_node,target_node,files_percent,bytes_percent,translog_ops_percent,bytes | grep -v done | sort -n -k 9'` running on one of the elastic nodes for the details [17:55:09] cool thx [17:55:54] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Services, and 2 others: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4171740 (10MarcoAurelio) > This link shows that the rename process was correctly finished. I unblocked it with a script. I've teste... [17:59:07] (03PS2) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180501T1800) [18:00:22] (03CR) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [18:02:19] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1034 - https://phabricator.wikimedia.org/T182556#4171754 (10RobH) a:05RobH>03Cmjohnson [18:02:51] jouncebot: next [18:02:51] In 0 hour(s) and 57 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180501T1900) [18:03:01] 10Operations, 10hardware-requests: eqiad/codfw: (4)+(4) hardware access request for videoscalers - https://phabricator.wikimedia.org/T188075#4171758 (10RobH) >>! In T188075#4048062, @brion wrote: > *nod* If there's general agreement not to add more specific hardware yet, we can just work with the reassigned im... [18:03:26] pff, next swat window is @01 AM my time, no way [18:07:40] Hauskatze: btw, regarding the wiki creation, i don't there are "post wiki creation things" that an ops does [18:07:47] or at least i have never heard of them [18:08:19] usually we just merge DNS and Apache change [18:08:30] and then there is of course the DBA prep thing for sync to cloud [18:08:34] but that's all i know [18:08:41] "Notify the Operations list. In particular, it needs to be made clear whether the wiki should be public, or private. If public, ops will arrange for the wiki to be replicated to Cloud Services. If private, ops will need to add the wiki to $private_wikis in operations/puppet.git:/manifests/realm.pp." [18:08:52] https://wikitech.wikimedia.org/wiki/Add_a_wiki#Notify [18:09:07] just following instructions [18:09:31] but this is _before_ wiki creation [18:09:49] yeah, and idwikimedia has not been created yet [18:10:03] or are we talking about different wikis? [18:10:18] * Hauskatze is running short of coffee [18:10:32] i was talking about this: ". An op should request a window to create the wiki and merge its config and do post-install stuff " [18:10:50] that differs from any other wiki creation i have seen so far [18:10:50] ah [18:11:13] mutante: well, only an op can run the addWiki.php script afaics [18:11:27] i think it's "any deployer" [18:11:36] definitely not for a SWAT window [18:11:51] well, it should be possible for anyone with 'deployment' access to do the full process [18:11:55] (03PS3) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) [18:12:15] yep, not SWAT window and yep, any deployer [18:12:17] in theory anyone with 'restricted' can create the wiki as well, but they'll not be able to fully complete the proces [18:12:59] I'm sorry if I don't distinguish between ops/deployers; for me they're all the same [18:13:00] restricted is weird and not as restricted as it sounds [18:13:05] people with access to the machines [18:13:21] restricted == deployment minus the ability to deploy [18:13:23] :P [18:14:04] (03CR) 10Anomie: "I noticed one more issue." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [18:17:33] (03CR) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [18:28:26] 10Operations, 10Traffic: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521#4171896 (10Bawolff) [18:34:37] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:37:32] (03CR) 10Deskana: [C: 031] Enable wgCiteResponsiveReferences on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430075 (https://phabricator.wikimedia.org/T193491) (owner: 10Framawiki) [18:37:55] (03PS1) 10Rush: openstack: add nova-fullstack to net_standby [puppet] - 10https://gerrit.wikimedia.org/r/430099 [18:38:37] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:39:58] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:42:53] (03PS2) 10Rush: openstack: add nova-fullstack to net_standby [puppet] - 10https://gerrit.wikimedia.org/r/430099 [18:44:54] (03PS1) 10ArielGlenn: don't die if file we want to report on is suddenly gone [dumps] - 10https://gerrit.wikimedia.org/r/430100 [18:44:58] RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:46:02] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/11088/labnet1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/430099 (owner: 10Rush) [18:47:51] (03CR) 10Rush: [C: 032] openstack: add nova-fullstack to net_standby [puppet] - 10https://gerrit.wikimedia.org/r/430099 (owner: 10Rush) [18:50:38] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:14] (03CR) 10ArielGlenn: [C: 032] don't die if file we want to report on is suddenly gone [dumps] - 10https://gerrit.wikimedia.org/r/430100 (owner: 10ArielGlenn) [18:52:53] (03PS1) 10Ottomata: Need to set root logger to use logstash handler [puppet] - 10https://gerrit.wikimedia.org/r/430104 (https://phabricator.wikimedia.org/T193230) [18:53:28] !log ariel@tin Started deploy [dumps/dumps@5438d41]: keep running even if file we want to report on is moved/gone [18:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:32] !log ariel@tin Finished deploy [dumps/dumps@5438d41]: keep running even if file we want to report on is moved/gone (duration: 00m 04s) [18:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:48] (03PS2) 10Ottomata: Need to set root logger to use logstash handler [puppet] - 10https://gerrit.wikimedia.org/r/430104 (https://phabricator.wikimedia.org/T193230) [18:54:38] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[nova-fullstack] [18:54:47] (03CR) 10Ottomata: [C: 032] Need to set root logger to use logstash handler [puppet] - 10https://gerrit.wikimedia.org/r/430104 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [18:58:55] ^ me [18:59:06] (03PS1) 10Rush: openstack: switch nova-fullstack over to service [puppet] - 10https://gerrit.wikimedia.org/r/430106 [18:59:18] (03PS2) 10Rush: openstack: switch nova-fullstack over to service [puppet] - 10https://gerrit.wikimedia.org/r/430106 [18:59:39] (03CR) 10Rush: "labnet1001.eqiad.wmnet,labnet1002.eqiad.wnet" [puppet] - 10https://gerrit.wikimedia.org/r/430106 (owner: 10Rush) [18:59:41] (03CR) 10jerkins-bot: [V: 04-1] openstack: switch nova-fullstack over to service [puppet] - 10https://gerrit.wikimedia.org/r/430106 (owner: 10Rush) [19:00:04] no_justification: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180501T1900). [19:00:45] !log rolling restart of eventlogging-service-eventbus to apply logstash logging configs - T193230 [19:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:50] T193230: EventBus HTTP Proxy service does not report errors to logstash - https://phabricator.wikimedia.org/T193230 [19:01:35] (03CR) 10Rush: "labnet1001.eqiad.wmnet,labnet1002.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/430106 (owner: 10Rush) [19:04:35] 10Operations, 10Traffic: Consider adding expect-CT: header to enforce certificate transparency - https://phabricator.wikimedia.org/T193521#4172059 (10BBlack) a:03Vgutierrez [19:04:55] (03PS3) 10Rush: openstack: switch nova-fullstack over to service [puppet] - 10https://gerrit.wikimedia.org/r/430106 [19:05:41] (03PS4) 10Rush: openstack: switch nova-fullstack over to service [puppet] - 10https://gerrit.wikimedia.org/r/430106 [19:06:22] (03CR) 10Rush: [C: 032] openstack: switch nova-fullstack over to service [puppet] - 10https://gerrit.wikimedia.org/r/430106 (owner: 10Rush) [19:09:34] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:12:39] (03PS1) 10Rush: openstack: nova-network service stopped on net secondary [puppet] - 10https://gerrit.wikimedia.org/r/430109 [19:15:10] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2011 - https://phabricator.wikimedia.org/T187886#4172190 (10Papaul) [19:15:52] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.090 second response time [19:16:59] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2011 - https://phabricator.wikimedia.org/T187886#3989185 (10Papaul) [19:17:16] !log mw2174,mw2175,mw2176 ff - reinstalling with wmf-auto-reimage to stretch [19:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:34] (03PS1) 10Ottomata: Add logstash tag 'eventlogging-service-eventbus' [puppet] - 10https://gerrit.wikimedia.org/r/430111 (https://phabricator.wikimedia.org/T193230) [19:18:10] (03PS2) 10Ottomata: Add logstash tag 'eventlogging-service-eventbus' [puppet] - 10https://gerrit.wikimedia.org/r/430111 (https://phabricator.wikimedia.org/T193230) [19:20:03] (03PS2) 10Rush: openstack: nova-network service stopped on net secondary [puppet] - 10https://gerrit.wikimedia.org/r/430109 [19:20:52] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1978 bytes in 0.106 second response time [19:22:02] (03CR) 10Rush: "labnet1001.eqiad.wmnet,labnet1002.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/430109 (owner: 10Rush) [19:22:20] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11092/kafka1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/430111 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [19:27:11] !log rolling restart of eventbus to apply logstash tag https://phabricator.wikimedia.org/T193230 [19:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:34] (03CR) 10Rush: [C: 032] openstack: nova-network service stopped on net secondary [puppet] - 10https://gerrit.wikimedia.org/r/430109 (owner: 10Rush) [19:32:45] (03PS3) 10Rush: openstack: nova-network service stopped on net secondary [puppet] - 10https://gerrit.wikimedia.org/r/430109 [19:32:56] (03CR) 10Rush: [C: 032] "http://puppet-compiler.wmflabs.org/11093/" [puppet] - 10https://gerrit.wikimedia.org/r/430109 (owner: 10Rush) [19:39:13] (03PS1) 10Rush: openstack: collapse net and net_standby into one role [puppet] - 10https://gerrit.wikimedia.org/r/430114 [19:42:15] (03CR) 10Rush: [C: 032] openstack: collapse net and net_standby into one role [puppet] - 10https://gerrit.wikimedia.org/r/430114 (owner: 10Rush) [19:48:40] (03PS1) 10Rush: DO NOT MERGE [puppet] - 10https://gerrit.wikimedia.org/r/430118 [19:49:18] !log restbase: begin culling leaked revisions, others_T_mobile__ng_remaining - T192689 [19:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:22] T192689: Unchecked storage growth(?) - https://phabricator.wikimedia.org/T192689 [19:51:46] (03CR) 10Rush: "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Network_node_failure" [puppet] - 10https://gerrit.wikimedia.org/r/430118 (owner: 10Rush) [19:52:02] (03CR) 10Rush: "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting#Network_node_failure" [puppet] - 10https://gerrit.wikimedia.org/r/430118 (owner: 10Rush) [19:53:09] (03PS1) 10Ottomata: Use python-logstash formatter version 1 (not 0) [puppet] - 10https://gerrit.wikimedia.org/r/430120 (https://phabricator.wikimedia.org/T193230) [19:53:31] (03PS2) 10Ottomata: Use python-logstash formatter version 1 (not 0) [puppet] - 10https://gerrit.wikimedia.org/r/430120 (https://phabricator.wikimedia.org/T193230) [19:55:26] (03CR) 10Ottomata: [C: 032] Use python-logstash formatter version 1 (not 0) [puppet] - 10https://gerrit.wikimedia.org/r/430120 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [19:55:28] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11095/kafka1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/430120 (https://phabricator.wikimedia.org/T193230) (owner: 10Ottomata) [19:55:30] (03PS1) 10Urbanecm: Upload new logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430123 (https://phabricator.wikimedia.org/T193562) [19:56:03] (03PS1) 10Urbanecm: Use uploaded HD logos for yiwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) [19:59:18] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.30 [keeping static files] (duration: 01m 43s) [19:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:14] !log rolling restart of eventbus to apply new logstash formatter version T193230 [20:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:18] T193230: EventBus HTTP Proxy service does not report errors to logstash - https://phabricator.wikimedia.org/T193230 [20:06:19] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.27 (duration: 03m 08s) [20:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:12] (03PS4) 10Gilles: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) [20:14:50] (03PS5) 10Gilles: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) [20:16:13] (03CR) 10jerkins-bot: [V: 04-1] Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [20:20:19] (03PS6) 10Gilles: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) [20:28:42] (03PS1) 10Chad: group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430248 [20:28:44] (03CR) 10Chad: [C: 032] group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430248 (owner: 10Chad) [20:28:50] !log restbase: begin culling leaked revisions, commons_T_mobile__ng_lead - T192689 [20:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:54] T192689: Unchecked storage growth - https://phabricator.wikimedia.org/T192689 [20:30:13] (03Merged) 10jenkins-bot: group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430248 (owner: 10Chad) [20:35:09] !log update stale apifeatureusage-search-svc-eqiad-wmnet template in eqiad elasticsearch and delete unused apifeatureusage template [20:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:58] !log demon@tin Started scap: group0 to wmf.2 [20:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:08] (03CR) 10jenkins-bot: group0 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430248 (owner: 10Chad) [20:41:19] (03PS1) 10EBernhardson: Convert text fields to string in apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/430250 (https://phabricator.wikimedia.org/T192614) [20:43:58] !log restbase: begin culling leaked revisions, commons_T_mobile__ng_remaining - T192689 [20:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:02] T192689: Unchecked storage growth - https://phabricator.wikimedia.org/T192689 [20:47:13] no_justification: Once the trail is done and looks good, would like to roll out https://gerrit.wikimedia.org/r/#/c/430117/ to help with T191282 [20:47:13] T191282: Wikimedia\Rdbms\LoadBalancer::{closure}: found writes pending - https://phabricator.wikimedia.org/T191282 [20:49:29] (03PS2) 10EBernhardson: Convert string fields to text in apifeatureusage [puppet] - 10https://gerrit.wikimedia.org/r/430250 (https://phabricator.wikimedia.org/T192614) [21:02:06] PROBLEM - Check size of conntrack table on mw2174 is CRITICAL: Return code of 255 is out of bounds [21:02:42] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4172772 (10mmodell) @awight: `git lfs install` needs to be executed on each target and that isn't happening, currently. I can add a ho... [21:03:45] PROBLEM - Nginx local proxy to apache on mw2174 is CRITICAL: connect to address 10.192.32.62 and port 443: Connection refused [21:03:46] PROBLEM - Check systemd state on mw2174 is CRITICAL: Return code of 255 is out of bounds [21:05:26] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2174 is CRITICAL: Return code of 255 is out of bounds [21:23:32] AaronSchulz: staging on mwdebug1002 now [21:23:55] !log restbase: begin culling leaked revisions, enwiki_T_mobile__ng_{lead,remaining} - T192689 [21:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:00] T192689: Unchecked storage growth - https://phabricator.wikimedia.org/T192689 [21:26:47] no_justification: got a few errors running scap pull on mwdebug1002 during l10n stuff [21:26:53] e.g. [21:26:54] rsync: rename failed for "/srv/mediawiki/php-1.32.0-wmf.2/cache/l10n/upstream/l10n_cache-rgn.cdb.MD5" (from php-1.32.0-wmf.2/cache/l10n/upstream/.~tmp~/l10n_cache-rgn.cdb.MD5): No such file or directory (2) [21:27:04] Well, wmf.2 hasn't finished [21:27:09] You shouldn't be pulling [21:27:18] Aye, I misread the log. [21:27:22] :) [21:27:35] OK. I thought you finished about 20min ago, and weren't online anymore. [21:27:41] Anyways, my bad. I'll wait :) [21:28:02] Running a little behind today [21:28:58] no_justification: I'm holding off now, but FYI, I did pull down 430117 (wmf.1) to tin. [21:30:41] I can reset the state there if you want, or leave it as-is. [21:32:27] The deploy master is correct and wmf.1 didn't change, so you should be ok [21:32:54] Krinkle: also we can replace tin with deploy1001 now (this time for real) [21:33:10] i hear it's safe [21:35:29] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4172851 (10awight) @mmodell I'm reading some strange stuff here, https://github.com/git-lfs/git-lfs/wiki/Installation Apparently, `git... [21:38:17] (03PS1) 10Brian Wolff: Allow mediawiki-l and mediawiki-announce to be indexed [puppet] - 10https://gerrit.wikimedia.org/r/430260 (https://phabricator.wikimedia.org/T193572) [21:42:07] !log re-enabling eqsin-codfw link [21:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:16] !log demon@tin Finished scap: group0 to wmf.2 (duration: 68m 18s) [21:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:19] Krinkle: I'm done w/ train [21:47:22] I'm doing a limited ORES deployment, only to our canary box. [21:48:10] !log awight@tin Started deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627 [21:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:15] T180627: Support git-lfs - https://phabricator.wikimedia.org/T180627 [21:48:16] !log awight@tin Started deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627 [21:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:42] !log awight@tin Finished deploy [ores/deploy@4601497]: Test LFS deployment for ORES; T180627 (duration: 00m 26s) [21:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:52] (03Abandoned) 10Aaron Schulz: [DNM] Switch labs to using mcrouter instead of nutcracker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407464 (owner: 10Aaron Schulz) [21:49:39] (03PS2) 10Chad: Disable LQT on several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429463 [21:50:31] no_justification: Thanks [21:50:52] AaronSchulz: Staging on mwdebug1002, again :) [21:50:53] !log awight@tin Started deploy [ores/deploy@52347e0]: Test LFS deployment for ORES; T180627 [21:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:07] AaronSchulz: btw, it seems this message only happens on job runners, so nothing to verify, is that right? [21:51:24] I saw a few come from index.php on kowiki though [21:51:33] But not sure how to reproduce it [21:52:43] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4172903 (10mmodell) @awight: I don't think `git lfs install --local` will take care of the submodules. I suppose I could do `git submo... [21:52:56] I see the new msg with SqlSubscriptionManager::subscribe [21:54:15] !log awight@tin Finished deploy [ores/deploy@52347e0]: Test LFS deployment for ORES; T180627 (duration: 03m 21s) [21:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:19] T180627: Support git-lfs - https://phabricator.wikimedia.org/T180627 [21:55:04] AaronSchulz: Cool, I see it now. Nice [21:55:07] Rolling out [21:55:58] a few from api.php, but mostly RunSingleJob [21:57:29] !log krinkle@tin Synchronized php-1.32.0-wmf.1/includes/libs/rdbms/: Iba663c58224af9 (duration: 01m 19s) [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:09] !log awight@tin Started deploy [ores/deploy@bf182e2]: Rollback ores1001 to master [22:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:22] !log awight@tin Finished deploy [ores/deploy@bf182e2]: Rollback ores1001 to master (duration: 01m 13s) [22:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:34] 10Operations, 10Puppet, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4172946 (10RobH) This maybe something for @joe to review (since he is quite familiar with the compiler)? [22:04:57] !log krinkle@tin Synchronized php-1.32.0-wmf.1/extensions/EducationProgram/resources/: I7ca59823ffbf2 (duration: 01m 16s) [22:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:17] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4172949 (10awight) I'm happy with either solution, either a redundant `git lfs install` or the submodule foreach. It would be very su... [22:12:28] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4172971 (10Andrew) p:05Triage>03Normal [22:12:32] * Krinkle is done with deploys [22:19:16] !log rebooting labnet1002 for T193579 [22:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:20] T193579: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579 [22:24:59] (03CR) 10Krinkle: [C: 031] Collapse PHP_SAPI conditionals down into one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393355 (owner: 10Reedy) [22:28:21] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4173045 (10Andrew) ``` root@labnet1002:~# uname -a Linux labnet1002 3.13.0-145-generic #194-Ubuntu SMP Thu Apr 5 15:20:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux ``` [22:44:46] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4173104 (10Andrew) [22:48:21] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4173109 (10Andrew) @Cmjohnson, I propose to fail over to labnet1002 on May 8th (Tuesday) and switch back to 1001 on May 15th (also a Tuesday). Can you commit to re-racki... [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180501T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:09:55] RECOVERY - Check size of conntrack table on mw2174 is OK: OK: nf_conntrack is 0 % full [23:26:53] 10Operations, 10Gerrit, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#4173222 (10mmodell) {D1039} [23:28:13] (03PS4) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) [23:29:04] (03CR) 10Bstorm: "To insert the database name, a code change was required either way. This was by far the simpler option. To go with the better join, I'd " [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [23:35:26] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2174 is OK: OK: synced at Tue 2018-05-01 23:35:24 UTC. [23:39:35] RECOVERY - Nginx local proxy to apache on mw2174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.059 second response time