[00:00:32] RECOVERY - Disk space on elastic1018 is OK: DISK OK [00:49:11] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [00:52:22] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:56:31] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [00:57:31] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [01:27:21] PROBLEM - High lag on wdqs1003 is CRITICAL: 3658 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:27:48] !log reindexing Serbian wikis on elastic@codfw (T196404) [02:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:51] T196404: Re-Re-Index Serbian Wikis after refactored plugins are deployed - https://phabricator.wikimedia.org/T196404 [02:33:07] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.8) (duration: 13m 12s) [02:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:12] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1139 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:04:35] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.999) (duration: 14m 30s) [03:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:02] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [03:52:21] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:35:47] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968#4295963 (10Joe) p:05Triage>03Normal a:03Joe [05:39:52] 10Operations, 10Wikimedia-Mailing-lists: Official support for upgrade from existing Mailman 2.1 lists to Mailman 3 - https://phabricator.wikimedia.org/T130554#4295968 (10Joe) p:05Triage>03Low [05:41:53] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4295969 (10Joe) p:05Triage>03Normal a:03herron [06:03:21] 10Operations, 10ops-codfw, 10DC-Ops: Replace disk on wasat - https://phabricator.wikimedia.org/T197562#4295984 (10Joe) [06:03:34] 10Operations, 10ops-codfw, 10DC-Ops: Replace disk on wasat - https://phabricator.wikimedia.org/T197562#4295996 (10Joe) p:05Triage>03Normal [06:14:54] 10Operations, 10MediaWiki-Maintenance-scripts: cronspam cleanup: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375#4296014 (10Joe) [06:18:12] 10Operations: cronspam for slow queries in PageAssessments - https://phabricator.wikimedia.org/T197564#4296015 (10Joe) [06:18:14] 10Operations: cronspam for slow queries in PageAssessments - https://phabricator.wikimedia.org/T197564#4296026 (10Joe) p:05Triage>03Low [06:31:20] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4296028 (10Joe) a:03herron [07:48:42] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [07:52:01] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:26:33] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::hhvm: add auto_prepend_file everywhere [puppet] - 10https://gerrit.wikimedia.org/r/440822 (https://phabricator.wikimedia.org/T180183) [08:26:35] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::hhvm: enable TC garbage collection everywhere [puppet] - 10https://gerrit.wikimedia.org/r/440823 (https://phabricator.wikimedia.org/T103886) [08:27:20] 10Operations, 10Deployments, 10HHVM, 10Patch-For-Review, and 3 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#4296188 (10Joe) a:03Joe [08:27:41] 10Operations, 10Deployments, 10HHVM, 10Patch-For-Review, and 3 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1401646 (10Joe) will merge this change once we're out of the deployment freeze. [08:30:55] 10Operations, 10Operations-Software-Development, 10Pybal, 10Traffic, 10Patch-For-Review: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#4296208 (10Joe) @ema @Vgutierrez AIUI this bug is resolved since we've fixed the EtcdConfigObse... [08:31:03] 10Operations, 10Operations-Software-Development, 10Pybal, 10Traffic, 10Patch-For-Review: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#4296210 (10Joe) 05Open>03Resolved [08:32:21] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060#3087392 (10Joe) @herron any news on this? I am assigning the ticket to you as you have an open patch for this. [08:32:33] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060#4296216 (10Joe) a:03herron [08:33:24] 10Operations, 10Analytics: Broken /a/refinery-source/guard/run_all_guards.sh script on stat1002 - https://phabricator.wikimedia.org/T166937#4296220 (10Joe) [08:33:35] 10Operations, 10Analytics: Broken /a/refinery-source/guard/run_all_guards.sh script on stat1002 - https://phabricator.wikimedia.org/T166937#3312286 (10Joe) @elukey is this still ongoing? It's opened with priority high. [08:39:05] 10Operations, 10JADE, 10TechCom, 10Patch-For-Review, and 2 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4296230 (10awight) @daniel We're sort of in limbo now, implicitly blocking on potential TechCom discussion. Please let us know if there's a set date to discuss... [08:40:02] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive operations/puppet/varnishkafka repository - https://phabricator.wikimedia.org/T197503#4296232 (10Joe) p:05Triage>03Low [08:40:50] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive operations/puppet/varnishkafka repository - https://phabricator.wikimedia.org/T197503#4294420 (10Joe) @elukey since you did the work of removing the submodule, will you do the honours? [08:41:24] 10Operations, 10Release-Engineering-Team, 10Scap: find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470#4296237 (10Joe) p:05Triage>03High [08:44:41] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1020 - https://phabricator.wikimedia.org/T194855#4296241 (10Joe) p:05Triage>03Normal [08:45:47] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Add CI namespace in staging k8s cluster - https://phabricator.wikimedia.org/T196654#4296242 (10Joe) p:05Triage>03Normal a:03Joe [09:09:14] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive operations/puppet/varnishkafka repository - https://phabricator.wikimedia.org/T197503#4296277 (10elukey) a:03elukey [09:15:58] (03CR) 10Krinkle: [C: 031] profile::mediawiki::hhvm: add auto_prepend_file everywhere [puppet] - 10https://gerrit.wikimedia.org/r/440822 (https://phabricator.wikimedia.org/T180183) (owner: 10Giuseppe Lavagetto) [09:16:50] <_joe_> Krinkle: can't deploy it this week, sadly [09:17:11] (03CR) 10ArielGlenn: [V: 032 C: 032] allow writeuptopageid to write multiple output files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/436511 (https://phabricator.wikimedia.org/T196063) (owner: 10ArielGlenn) [09:20:35] _joe_: which freeze btw? (Catching up on email still...) [09:24:14] <_joe_> Krinkle: SRE summit underway [09:24:24] <_joe_> it's basically me and apergos still online [09:24:31] yup [09:25:12] <_joe_> so a deployment freeze was requested [09:26:26] :D [09:28:07] (03PS1) 10MarcoAurelio: Increase password policies for 'steward' to max [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440834 (https://phabricator.wikimedia.org/T197577) [09:36:41] (03PS2) 10MarcoAurelio: Increase password policies for 'steward' to max [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440834 (https://phabricator.wikimedia.org/T197577) [09:47:26] (03PS1) 10Giuseppe Lavagetto: Add token for kubernetes CI [labs/private] - 10https://gerrit.wikimedia.org/r/440836 [09:47:51] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add token for kubernetes CI [labs/private] - 10https://gerrit.wikimedia.org/r/440836 (owner: 10Giuseppe Lavagetto) [09:48:21] (03PS1) 10Giuseppe Lavagetto: role::ci::master: add kubeconfig to access the ci namespace [puppet] - 10https://gerrit.wikimedia.org/r/440837 (https://phabricator.wikimedia.org/T196654) [09:48:34] Hi team, just a ping to let you know I'm dpeloying analytics hadoop cluster scripts [09:49:15] !log joal@deploy1001 Started deploy [analytics/refinery@e9dbe79]: Regular weekly deploy [09:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:21] <_joe_> joal: uhm we're supposedly in a deployment freeze :P [09:49:39] <_joe_> but I don't know if that applies to analytics [09:50:01] <_joe_> I guess andrew and luca are both at the SRE summit [09:52:28] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11538/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/440837 (https://phabricator.wikimedia.org/T196654) (owner: 10Giuseppe Lavagetto) [09:55:21] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:55:22] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:56:08] !log joal@deploy1001 Finished deploy [analytics/refinery@e9dbe79]: Regular weekly deploy (duration: 06m 54s) [09:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:42] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:59:51] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:02:14] 10Operations, 10Traffic, 10User-Johan, 10User-notice: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4296396 (10Liuxinyu970226) [10:02:18] 10Operations, 10Traffic, 10User-notice: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202#4296397 (10Liuxinyu970226) [10:04:11] <_joe_> !log initialize namespace "ci" on the kubernetes staging cluster T196654 [10:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:13] T196654: Add CI namespace in staging k8s cluster - https://phabricator.wikimedia.org/T196654 [10:08:19] (03PS1) 10Giuseppe Lavagetto: ci::kubernetes_config: fix spurious space in filename [puppet] - 10https://gerrit.wikimedia.org/r/440838 [10:08:41] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] ci::kubernetes_config: fix spurious space in filename [puppet] - 10https://gerrit.wikimedia.org/r/440838 (owner: 10Giuseppe Lavagetto) [10:20:42] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Add CI namespace in staging k8s cluster - https://phabricator.wikimedia.org/T196654#4296449 (10Joe) I created a namespace called `ci` that you can deploy to using helm as long as you use the kubeconfig `/etc/kubernet... [10:20:48] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Add CI namespace in staging k8s cluster - https://phabricator.wikimedia.org/T196654#4296450 (10Joe) 05Open>03Resolved [10:22:13] (03PS1) 10ArielGlenn: writeuptopageid writes multiple output files [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/440839 [10:22:41] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61283 MB (12% inode=99%) [10:27:02] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 61855 MB (12% inode=99%) [10:27:30] 10Operations, 10JADE, 10TechCom, 10Patch-For-Review, and 2 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4296454 (10daniel) @awight In our last session, TechCom decided that we should keep an eye on this, but there is no action required at this point. JADE si self... [10:31:10] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242#4296460 (10Sebastian_Berlin-WMSE) [10:37:08] 10Operations, 10JADE, 10TechCom, 10Patch-For-Review, and 2 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4296461 (10awight) @daniel Thanks for the helpful notes. We'll probably come back to TechCom for more discussion in a few months, once we see how integrations... [10:41:12] RECOVERY - Disk space on elastic1025 is OK: DISK OK [10:43:51] 10Operations, 10JADE, 10TechCom, 10Patch-For-Review, and 2 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4296475 (10Joe) I would like this to wait for a review by the #dba and #traffic teams. Specifically: how badly would we be affected by a growth of the `page` t... [10:46:59] 10Operations, 10JADE, 10TechCom, 10Patch-For-Review, and 2 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#4296478 (10Joe) See also T196547 where the discussion should probably continue [10:47:07] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296479 (10Joe) p:05Triage>03Normal [10:48:20] 10Operations, 10Wikimedia-Incident: Add email queueing/failover to services currently using mail_smarthost[0] - https://phabricator.wikimedia.org/T196920#4296485 (10Joe) p:05Triage>03High [10:48:27] Hi _joe_ - I'm sorry I missed your ping earlier [10:48:53] 10Operations, 10Mail, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Phabricator outbound email seems to have a SPOF of mx1001 - https://phabricator.wikimedia.org/T196916#4296486 (10Joe) p:05Triage>03High [10:49:05] _joe_: Luca is indeed gone to SRE summit, but approved the deploy before leaving (i wouldn't dare deploy without their approval :) [10:49:07] 10Operations, 10ops-eqiad, 10DC-Ops: Replace memory bank on scb1002 - https://phabricator.wikimedia.org/T196901#4296487 (10Joe) p:05Triage>03Low [10:50:51] 10Operations, 10monitoring: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886#4296490 (10Joe) [10:50:58] 10Operations, 10monitoring: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886#4271553 (10Joe) p:05Triage>03Normal [10:51:25] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886#4271553 (10Joe) [10:52:15] <_joe_> !log removing wtp1043 from all pybal configuration until the disk is replaced T196886 [10:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:17] T196886: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 [10:53:35] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886#4296498 (10Joe) [10:55:57] 10Operations, 10LDAP-Access-Requests: Add MSantos to `ldap/wmf` - https://phabricator.wikimedia.org/T196943#4296499 (10Joe) a:05dr0ptp4kt>03Joe [10:56:31] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296506 (10awight) >>! In T183381#4296475, @Joe wrote: > I would at least think we should exclude bots from editing/creati... [10:58:53] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4260809 (10daniel) Since most concern pivot around the question of scalability, especially of the page table, in the case... [11:05:57] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296531 (10daniel) For the record, it would be possible to use MCR to store judgments about revisions. The idea would be t... [11:08:37] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4282750 (10Joe) @MSantos while we wait to understand the specific accesses you need, can you please read and sign the L3 document? So I can proceed to create your user and also to add... [11:12:41] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [11:13:58] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4296535 (10Joe) Specifically, it would be useful to use the permissions of another person in your team as a blueprint ("I need the same level of access as X" would help us specify bett... [11:14:51] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [11:17:14] (03PS1) 10Giuseppe Lavagetto: admin: add data for mbsantos in ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/440842 (https://phabricator.wikimedia.org/T196943) [11:18:31] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: add data for mbsantos in ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/440842 (https://phabricator.wikimedia.org/T196943) (owner: 10Giuseppe Lavagetto) [11:22:42] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296552 (10awight) >>! In T196547#4296510, @daniel wrote: > Since most concern pivot around the question of scalability, e... [11:27:33] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296563 (10awight) >>! In T196547#4296531, @daniel wrote: > For the record, it would be possible to use MCR to store judgm... [11:29:40] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296564 (10daniel) Yes, that's what I'm suggesting. [11:35:16] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296590 (10awight) >>! In T196547#4296564, @daniel wrote: > Yes, that's what I'm suggesting: make the JADE edit a separate... [11:37:51] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296592 (10daniel) > I was concerned that recent change patrolling and AbuseFilter might not be well-integrated. We'll giv... [11:40:56] (03PS1) 10ArielGlenn: on dryrun, return the right number of results after (not) running command [dumps] - 10https://gerrit.wikimedia.org/r/440846 [11:59:35] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296623 (10awight) Some thoughts about MCR: * We want this new structured space to be available for both collaborative aud... [12:13:40] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4296657 (10daniel) > I'm not sure whether the article's talk page would be the right place for these discussions? I think... [12:31:39] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add MSantos to `ldap/wmf` - https://phabricator.wikimedia.org/T196943#4296686 (10Joe) Done. You should be able to access the corresponding resources. [12:31:46] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add MSantos to `ldap/wmf` - https://phabricator.wikimedia.org/T196943#4296687 (10Joe) 05Open>03Resolved [12:33:29] 10Operations: Update wikitech-static mediawiki version - https://phabricator.wikimedia.org/T197554#4296697 (10Joe) p:05Triage>03Low [12:39:50] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4296714 (10Joe) a:03Joe [12:42:29] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4296716 (10Joe) mwdebug2001 now has 8 free gigabytes, but one must wonder how we... [12:42:52] 10Operations, 10ops-eqiad, 10cloud-services-team: Labservices1001 crashed - https://phabricator.wikimedia.org/T196252#4296717 (10Joe) p:05Triage>03Normal [12:43:05] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: Rack/Setup frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T196417#4296718 (10Joe) p:05Triage>03Normal [12:44:27] 10Operations, 10Maps-Sprint, 10Maps (Tilerator): Externalize tile storage for maps - https://phabricator.wikimedia.org/T196474#4296720 (10Joe) p:05Triage>03Normal [12:55:32] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4296760 (10Reedy) ~4GB per MW version. 1.7GB l10n cdbs... 1.7GB of l10n json file... [13:07:49] 10Operations, 10Wikimedia-Logstash, 10Services (watching): Logstash started showing full serialized log entry as a message - https://phabricator.wikimedia.org/T197219#4282186 (10Joe) At a quick glance, neither Mediawiki-generated logs nor syslog generated ones show this issue. I can't find anything relevant... [13:10:25] (03CR) 10Reedy: Increase password policies for 'steward' to max (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440834 (https://phabricator.wikimedia.org/T197577) (owner: 10MarcoAurelio) [13:22:06] 10Operations, 10Wikimedia-Logstash, 10Services (watching): Logstash started showing full serialized log entry as a message - https://phabricator.wikimedia.org/T197219#4296812 (10Joe) The problem comes from https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/437864/ [13:22:16] 10Operations, 10Wikimedia-Logstash, 10Services (watching): Logstash started showing full serialized log entry as a message - https://phabricator.wikimedia.org/T197219#4296813 (10Joe) a:03Joe [13:27:45] (03PS1) 10Urbanecm: Enable TemplateStyles on ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440859 (https://phabricator.wikimedia.org/T197526) [13:30:09] (03PS1) 10WMDE-Fisch: Add ar, de and fa wikipedia to FileImporter interwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440860 (https://phabricator.wikimedia.org/T196976) [13:30:21] (03PS1) 10Giuseppe Lavagetto: role::logstash: fix gelf filtering [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) [13:31:02] <_joe_> mobrovac: ^^ [13:31:13] <_joe_> also whoever else is around [13:31:22] <_joe_> apergos, herron by any chance? [13:31:23] (03CR) 10jerkins-bot: [V: 04-1] Add ar, de and fa wikipedia to FileImporter interwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440860 (https://phabricator.wikimedia.org/T196976) (owner: 10WMDE-Fisch) [13:31:33] I'm here [13:31:58] <_joe_> apergos: can you take a look at https://gerrit.wikimedia.org/r/440861 ? logstash is broken for node services since wednesday [13:32:15] hey [13:32:29] I'm looking. I don't know the GELF formats so it's going to take a bit [13:32:43] (03CR) 10Mobrovac: role::logstash: fix gelf filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) (owner: 10Giuseppe Lavagetto) [13:34:09] (03PS2) 10WMDE-Fisch: Add ar, de and fa wikipedia to FileImporter interwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440860 (https://phabricator.wikimedia.org/T196976) [13:34:19] (03CR) 10Giuseppe Lavagetto: role::logstash: fix gelf filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) (owner: 10Giuseppe Lavagetto) [13:34:28] hm _joe_, because we send full_message, this patch will continue showing the stringifed version in the message field [13:34:40] ok, let me actually write it on the patch [13:35:41] (03CR) 10Mobrovac: "This still replaces message with full_message, which will result in a no-op for the problem at hand. In our case, full_message simply shou" [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) (owner: 10Giuseppe Lavagetto) [13:36:26] (03PS2) 10Giuseppe Lavagetto: role::logstash: fix gelf filtering [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) [13:36:49] <_joe_> mobrovac: oh I see [13:36:58] <_joe_> so I need to twist the logic a bit [13:36:58] (03PS1) 10Urbanecm: Create a few of namespace aliases for ruwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440863 (https://phabricator.wikimedia.org/T197565) [13:38:58] (03PS3) 10Giuseppe Lavagetto: role::logstash: fix gelf filtering [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) [13:40:14] <_joe_> mobrovac: better this way? :P [13:42:14] (03CR) 10Ppchelko: "We don't send `long_message` at all, see https://github.com/mhart/gelf-stream/blob/master/gelf-stream.js#L89" [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) (owner: 10Giuseppe Lavagetto) [13:42:30] _joe_: hm, same thing, we don't send long_message at all, we send full_message which is a stringified version of the complete log entry, cf. https://github.com/mhart/gelf-stream/blob/master/gelf-stream.js#L83-L90 [13:42:42] <_joe_> uh [13:42:46] _joe_: so basically we want this if-block to disappear altogether [13:42:48] :P [13:43:03] <_joe_> mobrovac: but it's needed for elasticsearch IIRC [13:43:11] so full-message gets set by some apps to serialized json, even though long-message is also set, is this right? next question: are there apps that [13:43:12] sigh [13:43:18] nm [13:43:31] question already out of date [13:43:41] <_joe_> yeah [13:44:12] <_joe_> mobrovac: you want short_message to be converted to message, right? [13:44:18] according to the spec, short_message should be replaced by message, not full_message [13:44:21] yes _joe_ [13:44:23] exactly [13:44:26] <_joe_> ok [13:44:34] <_joe_> that's easy to do [13:45:59] <_joe_> uhm not exactly the way we wanted to, but we can still do it. [13:54:40] <_joe_> Pchelolo, mobrovac can't we configure gelt-stream not to add full_message ? [13:55:19] * mobrovac was just waiting for _joe_ to propose to "fix" gelf-stream [13:55:24] _joe_: we can fork that library... [13:55:35] <_joe_> I asked if we can configure it [13:55:39] no [13:56:23] (03PS1) 10WMDE-Fisch: Enable license filters for the FileImporter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440864 (https://phabricator.wikimedia.org/T194502) [13:56:50] <_joe_> ok so next step is to make a specific exception for elasticsearch [13:58:47] 10Operations, 10JADE, 10TechCom, 10Patch-For-Review, and 2 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381#3851603 (10awight) a:03awight [14:03:59] (03PS4) 10Giuseppe Lavagetto: role::logstash: fix gelf filtering [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) [14:04:06] <_joe_> mobrovac, Pchelolo this version could work [14:04:34] (03PS1) 10Bmansurov: Increase Schema:CitationUsage sampling rate to 15% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440867 [14:07:09] <_joe_> well without further ado, lemme merge [14:07:29] <_joe_> once I remove the whitespaces, that is [14:07:32] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4296970 (10Papaul) p:05Triage>03Normal [14:08:27] (03PS5) 10Giuseppe Lavagetto: role::logstash: fix gelf filtering [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) [14:08:34] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Critical syslog messages [14:09:56] (03CR) 10Giuseppe Lavagetto: [C: 032] role::logstash: fix gelf filtering [puppet] - 10https://gerrit.wikimedia.org/r/440861 (https://phabricator.wikimedia.org/T197219) (owner: 10Giuseppe Lavagetto) [14:11:49] (03CR) 10Bmansurov: [C: 04-1] "To be deployed on 6/25." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440867 (owner: 10Bmansurov) [14:20:46] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10Services (watching): Logstash started showing full serialized log entry as a message - https://phabricator.wikimedia.org/T197219#4297032 (10Joe) So after some reasoning: - elasticsearch needs to use `full_message` as short message is truncated - th... [14:20:49] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10Services (watching): Logstash started showing full serialized log entry as a message - https://phabricator.wikimedia.org/T197219#4297033 (10Joe) 05Open>03Resolved p:05Triage>03High [14:24:28] (03CR) 10Anomie: "It looks like I1a72aab4b2 already fixed it in a different (and probably more correct, since the MW_INSTALL_PATH environment variable still" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440743 (owner: 10Gergő Tisza) [14:30:07] 10Operations, 10Wikimedia-Mailing-lists: New closed communication public policy mailing list needed - https://phabricator.wikimedia.org/T196041#4297049 (10herron) a:03herron [14:31:45] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4297052 (10Papaul) a:05Papaul>03Marostegui Disk replaced [14:43:54] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Enable license filters for the FileImporter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440864 (https://phabricator.wikimedia.org/T194502) (owner: 10WMDE-Fisch) [14:45:33] ACKNOWLEDGEMENT - HP RAID on db2052 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T197606 [14:45:42] 10Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T197606#4297092 (10ops-monitoring-bot) [14:49:46] 10Operations, 10Mail, 10Wikimedia-Logstash: Ship MX logs to ELK - https://phabricator.wikimedia.org/T197173#4297095 (10Joe) p:05Triage>03Normal [14:50:35] 10Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T197606#4297097 (10Joe) p:05Triage>03Normal [14:50:58] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4297098 (10Marostegui) a:05Marostegui>03Papaul @Papaul disk failed, can we get another one? ``` physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Failed) ``` [14:51:26] 10Operations, 10ops-codfw: Degraded RAID on db2052 - https://phabricator.wikimedia.org/T197606#4297101 (10Marostegui) [14:51:28] 10Operations, 10Mail, 10monitoring, 10Wikimedia-Incident: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172#4297103 (10Joe) p:05Triage>03High [14:52:37] 10Operations, 10monitoring: Report problems found by mcelog - https://phabricator.wikimedia.org/T197086#4297107 (10Joe) p:05Triage>03Normal [14:53:21] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4297110 (10Papaul) a:05Papaul>03Marostegui done [14:53:34] 10Operations, 10monitoring: Report problems found in server's IPMI SEL - https://phabricator.wikimedia.org/T197084#4297112 (10Joe) p:05Triage>03Normal [14:54:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1019 IPMI alert - https://phabricator.wikimedia.org/T196751#4297113 (10Joe) p:05Triage>03Low [14:56:49] (03PS1) 10Jforrester: Update BetaFeature natural retirement dates based on last user-facing change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440878 [14:58:05] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4297122 (10AfroThundr3007730) [14:59:16] I'm still seeing stuff in logstash with those long entries for event_str i.e. https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.06.18/cpjobqueue?id=AWQTZc7MoOODFPKvBUHd&_g=h@44136fa [14:59:33] am I misreading something? [15:00:55] mobrovac: that entry is still off, yes? [15:00:58] <_joe_> apergos: that is expected [15:01:15] <_joe_> the difference is you don't get the whole thing in "message" too [15:02:03] ok nm; the old event also has a nice looknig 'message' so I zoomed right by it [15:03:31] looking now indeed [15:03:32] thnx _joe_ [15:03:41] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4297146 (10Deskana) [15:08:56] (03CR) 10Anomie: Move CLI overrides after InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440543 (https://phabricator.wikimedia.org/T197475) (owner: 10Anomie) [15:26:27] (03PS3) 10ArielGlenn: generate temp stubs for page ranges serially from same input stub file [dumps] - 10https://gerrit.wikimedia.org/r/436956 (https://phabricator.wikimedia.org/T196063) [15:42:15] (03PS3) 10Thiemo Kreuz (WMDE): Add ar, de and fa wikipedia to FileImporter interwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440860 (https://phabricator.wikimedia.org/T196976) (owner: 10WMDE-Fisch) [15:43:04] (03PS4) 10Thiemo Kreuz (WMDE): Add ar, de and fa wikipedia to FileImporter interwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440860 (https://phabricator.wikimedia.org/T196969) (owner: 10WMDE-Fisch) [15:43:09] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Add ar, de and fa wikipedia to FileImporter interwiki config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440860 (https://phabricator.wikimedia.org/T196969) (owner: 10WMDE-Fisch) [15:53:08] 10Operations, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547#4297337 (10awight) > Anyway, I didn't intend to derail this discussion.... is this the right place to discuss MCR as an al... [16:06:31] RECOVERY - Device not healthy -SMART- on db2052 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2052&var-datasource=codfw%2520prometheus%252Fops [16:09:17] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4297383 (10Papaul) I chat with @ayounsi, he confirmed that both servers were in the correct VLAN's. What i did on my end was to unplugged the other 3 NIC's form both server... [16:14:00] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4297394 (10Papaul) [16:18:30] (03CR) 10Mobrovac: "Bumping as the upgrade to Cassandra has been completed." [puppet] - 10https://gerrit.wikimedia.org/r/426152 (https://phabricator.wikimedia.org/T192112) (owner: 10Eevans) [16:19:56] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate redis-cluster or other techniques for making Redis not a single point of failure. - https://phabricator.wikimedia.org/T181559#4297417 (10awight) [16:20:00] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Celery manager implodes horribly if Redis goes down - https://phabricator.wikimedia.org/T181632#4297416 (10awight) [16:21:16] 10Operations, 10Scoring-platform-team, 10Wikimedia-Incident: Investigate redis-cluster or other techniques for making Redis not a single point of failure. - https://phabricator.wikimedia.org/T181559#3794157 (10awight) [16:28:43] 10Operations, 10ops-codfw: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4297490 (10Papaul) [16:39:37] !log DROP unused Cassandra keyspaces - T197080 [16:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:39] T197080: Clean up leftover key spaces - https://phabricator.wikimedia.org/T197080 [16:48:51] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [16:52:11] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:55:00] (03PS4) 10ArielGlenn: generate temp stubs for page ranges serially from same input stub file [dumps] - 10https://gerrit.wikimedia.org/r/436956 (https://phabricator.wikimedia.org/T196063) [17:13:21] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4297629 (10Papaul) [17:17:10] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4297634 (10Papaul) a:05Papaul>03BBlack @BBlack Lvs2009 and lvs2010 are ready. For switch port information please see T196946. Once they are up, we can decommission lvs2... [17:18:29] 'git clone' from gerrit is no longer working for me. Anyone else have this issue? I'm pretty sure I saw someone else complain about it but I forget what channel that was. [17:31:39] moritzm: morning! would you be able to help chelsyx login to the noc account? her login got reset by google's 30-day thing so she's stuck at the verification code stage and she needs to manage some google search console stuff. [17:32:42] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:35:59] 10Operations: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624#4297708 (10herron) p:05Triage>03Normal [17:37:39] moritzm: Never mind. I can login now without a verification code... [17:40:40] Niharika: hi, did you try a upper case or lower case? [17:40:52] For the username of your using git clone over ssh [17:40:53] paladox: Lowercase. [17:41:10] paladox: Oh, no username. [17:41:20] paladox: "git clone ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CharInsert" [17:41:31] From https://gerrit.wikimedia.org/g/mediawiki/extensions/CharInsert [17:41:39] Ah needs a username [17:41:40] Oh [17:41:49] You hit the same issue as me [17:41:50] Okay. [17:42:15] Niharika: https://phabricator.wikimedia.org/T183205 [17:43:40] paladox: Cool, thanks. [17:43:44] (03PS1) 10Jgreen: swap out samarium, swap in frdata1001, to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/440898 [17:44:56] (03PS2) 10Jgreen: swap out samarium, swap in frdata1001, to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/440898 [17:45:09] paladox: I think my issue was different. Gitiles UI gives a ssh:// link to clone, which doesn't work. But gerrit itself gives the https:// link which works. [17:45:17] https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/CharInsert [17:45:22] (03CR) 10Jgreen: [V: 032 C: 032] swap out samarium, swap in frdata1001, to nsca_frack.cfg.erb [puppet] - 10https://gerrit.wikimedia.org/r/440898 (owner: 10Jgreen) [17:45:47] Niharika: that was the issue I have :) [17:45:47] Former doesn't work because of absence of necessary keys, of course. [17:45:51] Ah, okay! [17:45:55] I tried fixing it upstream [17:46:05] But no easy way to get the username [17:46:06] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4297734 (10mobrovac) The patch above increases the rendering concurrency, whichwas too low for production purposes any way. It should resolve... [17:46:10] It totally worked without the username though. [17:46:28] That would be for https:// [17:46:33] Ssh needs a username [17:46:41] Got it. :) [17:46:52] ssh doesn't need a username if it's in .ssh/config [17:49:30] (03PS1) 10Bstorm: Updating labvirt1019 mac [puppet] - 10https://gerrit.wikimedia.org/r/440899 (https://phabricator.wikimedia.org/T194964) [17:51:26] (03CR) 10Bstorm: [C: 032] Updating labvirt1019 mac [puppet] - 10https://gerrit.wikimedia.org/r/440899 (https://phabricator.wikimedia.org/T194964) (owner: 10Bstorm) [17:51:44] (03PS2) 10Bstorm: Updating labvirt1019 mac [puppet] - 10https://gerrit.wikimedia.org/r/440899 (https://phabricator.wikimedia.org/T194964) [18:07:34] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add MSantos to `ldap/wmf` - https://phabricator.wikimedia.org/T196943#4297785 (10MSantos) Thank you Joe! [18:11:30] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4297788 (10MSantos) Hello Joe, I am going to check that info with @dr0ptp4kt. Even though, I have read and signed the L3 document. [18:25:29] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, 10Multimedia: cronspam cleanup: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375#4297805 (10Krenair) [18:41:18] !log reindexing Serbian wikis on elastic@eqiad (T196404) [18:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:23] T196404: Re-Re-Index Serbian Wikis after refactored plugins are deployed - https://phabricator.wikimedia.org/T196404 [18:46:46] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4297874 (10Bstorm) Running re-install on labvirt1019 to cover changes. Then I'll rebuild the canary instance. [19:12:55] (03PS2) 10Hashar: ci: add some gated extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/440539 (https://phabricator.wikimedia.org/T197469) [19:15:07] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack frdata1001 - https://phabricator.wikimedia.org/T187364#4298024 (10Jgreen) 05Open>03Resolved [19:16:14] 10Operations, 10ops-eqiad: decommission samarium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T197630#4298026 (10Jgreen) [19:27:38] RECOVERY - Disk space on labvirt1019 is OK: DISK OK [19:27:51] RECOVERY - Check the NTP synchronisation status of timesyncd on labvirt1019 is OK: OK: synced at Mon 2018-06-18 19:27:44 UTC. [19:27:52] RECOVERY - dhclient process on labvirt1019 is OK: PROCS OK: 0 processes with command name dhclient [19:27:52] RECOVERY - DPKG on labvirt1019 is OK: All packages OK [19:28:11] RECOVERY - configured eth on labvirt1019 is OK: OK - interfaces up [19:28:41] RECOVERY - Check systemd state on labvirt1019 is OK: OK - running: The system is fully operational [19:29:26] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@37f6f32]: GUI update [19:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:22] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@37f6f32]: GUI update (duration: 00m 56s) [19:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:34] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@bcb2904]: GUI update [19:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:52] RECOVERY - kvm ssl cert on labvirt1019 is OK: Cert /etc/ssl/localcerts/labvirt-star.eqiad.wmnet.crt will not expire for at least 30 days. [19:49:01] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [19:49:26] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@bcb2904]: GUI update (duration: 18m 52s) [19:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:20] 10Operations, 10Mail, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Phabricator outbound email seems to have a SPOF of mx1001 - https://phabricator.wikimedia.org/T196916#4298063 (10herron) Network connectivity looks good from phab1001 to both MX servers. ``` phab1001:~# nc -vz mx1001.... [19:50:23] (03PS1) 10Herron: phabricator: set smtp-host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/440910 (https://phabricator.wikimedia.org/T196916) [19:52:21] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:27:54] ACKNOWLEDGEMENT - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Herron docker service on this host looks to have been throwing errors since at least may 31 with error initializing graphdriver: open /dev/mapper/docker-data: no such file or directory - The acknowledgement expires at: 2018-06-26 12:00:00. [20:30:33] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#4298154 (10Mholloway) [20:30:37] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Wikipedia-Android-App-Backlog: Determine URL paths for Zim files - https://phabricator.wikimedia.org/T172148#4298152 (10Mholloway) 05Open>03Invalid This is stalled, possibly indefinitely. Consider r... [20:31:34] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3444393 (10Mholloway) [20:32:10] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3444393 (10Mholloway) [20:32:14] 10Operations, 10Android-app-feature-Compilations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#4298169 (10Mholloway) 05Open>03Invalid This is stalled, possibly indefinitely. Conside... [20:41:53] (03PS1) 10Herron: gerrit: use localhost exim as smtp server [puppet] - 10https://gerrit.wikimedia.org/r/440970 (https://phabricator.wikimedia.org/T196920) [20:43:28] (03CR) 10Paladox: "Does gerrit have exim working locally?" [puppet] - 10https://gerrit.wikimedia.org/r/440970 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [20:57:55] (03CR) 10Herron: "Yes, here's a cursory test" [puppet] - 10https://gerrit.wikimedia.org/r/440970 (https://phabricator.wikimedia.org/T196920) (owner: 10Herron) [21:03:06] 10Operations, 10DNS, 10Traffic: Redirect http://status.wikipedia.org to http://status.wikimedia.org - https://phabricator.wikimedia.org/T32811#4298236 (10Framawiki) [21:03:09] 10Operations, 10DNS, 10Traffic: Redirect status.wikipedia.org to status.wikimedia.org - https://phabricator.wikimedia.org/T167239#3321697 (10Framawiki) Was previously denied : {T32811}. [21:12:53] 10Operations, 10monitoring, 10Privacy, 10Security-Core: status.wikimedia.org should not load Google Analytics - https://phabricator.wikimedia.org/T115945#4298247 (10Framawiki) 05Invalid>03Open a:05Ottomata>03None Hello @Ottomata. Ping @Dzahn and @BBlack. The fact that this site is hosted by a thir... [21:23:40] 10Operations, 10TimedMediaHandler-Transcode: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333#4298261 (10Reedy) [21:25:43] 10Operations, 10TimedMediaHandler-Transcode: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333#4298265 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [21:35:15] 10Operations, 10TimedMediaHandler-Transcode: Backport libvpx 1.7.0, ffmpeg packages for VP9 -row-mt option - https://phabricator.wikimedia.org/T190333#4298287 (10brion) Awesome thanks! I _think_ it should be straightforward. :) [21:38:13] (03PS1) 10Ladsgroup: Fix ORES config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440974 (https://phabricator.wikimedia.org/T197633) [21:48:12] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [21:49:12] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [22:05:43] 10Operations, 10Epic: Encrypt all the things - https://phabricator.wikimedia.org/T111653#4298314 (10Jgreen) [23:08:27] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4298470 (10Bstorm) So, the good, we are on 10G Ethernet ``` [bstorm@labvirt1019]:~ $ sudo ethtool eth0 Settings for eth0: Supported port...