[01:18:05] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4298585 (10Mholloway) [01:18:08] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4282750 (10Mholloway) Updated the task description with the specific groups needed. Thanks! [01:37:55] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4298604 (10dr0ptp4kt) [01:37:58] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4282750 (10dr0ptp4kt) Thank you, @Mholloway. I updated data/analytics groups as well. [01:39:24] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4298606 (10dr0ptp4kt) [02:19:23] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.8) (duration: 07m 39s) [02:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:12] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [02:26:21] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [02:36:10] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.999) (duration: 06m 53s) [02:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:12] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:22:17] 10Operations, 10Privacy, 10Security: status.wikimedia.org should have an alternative privacy policy - https://phabricator.wikimedia.org/T189763#4298630 (10Bawolff) [03:28:11] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 665.75 seconds [03:34:41] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 229.34 seconds [03:36:01] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [03:37:11] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [03:37:11] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [03:38:21] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [04:00:21] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [04:01:22] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:03:41] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [04:04:42] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [04:09:11] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:10:21] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received: /{domain}/v1/translation/articles/{source}{/seed} (bad seed) timed out before a response was received [04:12:32] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [04:15:42] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [04:16:52] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:20:21] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [04:23:31] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [04:29:11] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [04:31:21] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [04:34:31] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [04:45:31] !log ban elastic1035 from cluster to allow it to recover [04:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:57] 10Operations, 10ops-codfw: Disk predictive failure on db2052 - https://phabricator.wikimedia.org/T197146#4298664 (10Marostegui) a:05Marostegui>03Papaul This disk has also errors: ``` physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 600 GB, Predictive Failure) ``` [04:56:31] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:56:48] !log unban elastic1035 [04:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [05:31:15] 10Operations: Improve visibility of incoming operations tasks - https://phabricator.wikimedia.org/T197624#4298695 (10Aklapper) (Just for completeness: {icon search} in the upper right corner of project workboards would also allow filtering for most recent tasks via {nav name=Advanced Filter... > Created After}.) [05:38:44] (03PS1) 10Ladsgroup: snapshot: make wikidata dump cronjobs use dump db servers [puppet] - 10https://gerrit.wikimedia.org/r/440986 (https://phabricator.wikimedia.org/T147169) [05:40:21] (03CR) 10Ladsgroup: "This should not be merged until the patch in core gets deployed: Ic51204a6f6ce9db4cc96108e823e388512724eff" [puppet] - 10https://gerrit.wikimedia.org/r/440986 (https://phabricator.wikimedia.org/T147169) (owner: 10Ladsgroup) [05:57:53] (03CR) 10ArielGlenn: [C: 031] "Very excited to see this happening! Hit me up for merge when ready." [puppet] - 10https://gerrit.wikimedia.org/r/440986 (https://phabricator.wikimedia.org/T147169) (owner: 10Ladsgroup) [06:03:51] (03PS1) 10Urbanecm: Add namespace alias on pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440987 (https://phabricator.wikimedia.org/T197507) [06:07:44] (03PS2) 10Urbanecm: Add namespace alias on pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440987 (https://phabricator.wikimedia.org/T197507) [06:28:01] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:30:11] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:30:22] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:32:12] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:48:11] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:50:21] yeah that was me checking to see if these were the usual puppet whines that go away on a rerun (they do) [06:55:41] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:32] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:00:31] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:43] (03PS1) 10Marostegui: s2.hosts: Remove db1054 [software] - 10https://gerrit.wikimedia.org/r/440989 [07:27:49] 10Operations, 10Cloud-Services: 10G ports seem not to work on new HP hardware - https://phabricator.wikimedia.org/T197169#4298755 (10aborrero) >>! In T197169#4280925, @chasemp wrote: > ping @aborrero who indicated he had seem a similar issue in the past I had to use some non-free drivers in the past for HP se... [07:34:41] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 62050 MB (12% inode=99%) [07:37:20] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: bootstrap keystone [puppet] - 10https://gerrit.wikimedia.org/r/440109 (https://phabricator.wikimedia.org/T196633) [07:39:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [07:41:12] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [07:41:52] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 24, down: 0, shutdown: 0 [07:53:54] ACKNOWLEDGEMENT - puppet last run on kubernetes2003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Physical_volume[/dev/md2],Volume_group[docker] alexandros kosiaris Reimaging this more correctly [08:01:17] 10Operations, 10DNS, 10Traffic: Redirect http://status.wikipedia.org to http://status.wikimedia.org - https://phabricator.wikimedia.org/T32811#4298820 (10hashar) >>! In T32811#347296, @hashar wrote: > RT mentioned in comment 1 is : > https://rt.wikimedia.org/Ticket/Display.html?id=1449 In Phabricator that... [08:01:41] 10Operations, 10DNS, 10Traffic: Redirect status.wikipedia.org to status.wikimedia.org - https://phabricator.wikimedia.org/T167239#4298826 (10hashar) [08:01:44] 10Operations, 10DNS, 10Traffic: Redirect http://status.wikipedia.org to http://status.wikimedia.org - https://phabricator.wikimedia.org/T32811#4298824 (10hashar) [08:03:56] 10Operations, 10DNS, 10Traffic: Redirect status.wikipedia.org to status.wikimedia.org - https://phabricator.wikimedia.org/T167239#3321697 (10hashar) Indeed. I have marked this task as a duplicate of T32811. The rationale is that it is tied to the wikipedia brand and certainly other projects (wikiquote, wiki... [08:13:12] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:13:32] RECOVERY - Disk space on elastic1020 is OK: DISK OK [08:18:21] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 5 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:18:53] <_joe_> I might have spoken too soon [08:20:14] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4298887 (10Joe) So there isn't much I can do right now, the situation recovered;... [08:20:37] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4298888 (10Joe) 05Open>03Resolved [08:43:30] 10Operations, 10DBA, 10MediaWiki-Configuration: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4298930 (10Joe) [08:43:47] 10Operations, 10DBA, 10MediaWiki-Configuration, 10User-Joe: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126#4279638 (10Joe) a:03Joe [09:10:32] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [09:11:41] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [09:25:41] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 [10:03:42] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [10:04:09] looking ^ [10:15:11] looks like a single node (elastic1036) is struggling, it seems to have happened to elastic1035 earlier last night, looking at the logs it seems someone banned it [10:19:02] still no rejection, I'll wait a bit to see how it evolves [10:25:59] <_joe_> dcausse: ack [10:26:05] <_joe_> let me know if you need help [10:26:12] _joe_: sure, thanks [10:51:21] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [10:53:58] (03PS2) 10WMDE-Fisch: Enable license filters for the FileImporter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440864 (https://phabricator.wikimedia.org/T194502) [10:54:28] (03PS1) 10Aklapper: Phab: Allow aklapper to purge user caches [puppet] - 10https://gerrit.wikimedia.org/r/441012 [10:55:05] (03CR) 10Aklapper: "Note: I have no idea how expensive running this command might be." [puppet] - 10https://gerrit.wikimedia.org/r/441012 (owner: 10Aklapper) [11:02:27] (03PS1) 10WMDE-Fisch: Enable FileImporter on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441013 (https://phabricator.wikimedia.org/T196969) [11:02:29] (03PS1) 10WMDE-Fisch: Enable FileExpoter on ar-, de- and fa-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441014 (https://phabricator.wikimedia.org/T196969) [11:04:15] (03CR) 10jerkins-bot: [V: 04-1] Enable FileImporter on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441013 (https://phabricator.wikimedia.org/T196969) (owner: 10WMDE-Fisch) [11:44:02] (03CR) 10WMDE-Fisch: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441013 (https://phabricator.wikimedia.org/T196969) (owner: 10WMDE-Fisch) [11:49:51] PROBLEM - Disk space on ms-be1019 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error [11:55:52] PROBLEM - MD RAID on ms-be1019 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [11:55:52] ACKNOWLEDGEMENT - MD RAID on ms-be1019 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T197676 [11:55:56] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T197676#4299439 (10ops-monitoring-bot) [11:56:21] PROBLEM - Disk space on ms-be1019 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdg1 is not accessible: Input/output error [11:56:41] PROBLEM - Check systemd state on ms-be1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:38:00] <_joe_> ouch I cannot ssh into ms-be1019 [12:41:11] <_joe_> !log hard reboot of ms-be1019 - unable to ssh, console showing i/o errors only [12:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:21] PROBLEM - Host ms-be1019 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:17] (03PS1) 10Paladox: Gerrit: Set log level to debug for AccountManager [puppet] - 10https://gerrit.wikimedia.org/r/441032 [12:49:41] RECOVERY - Disk space on ms-be1019 is OK: DISK OK [12:49:51] RECOVERY - Host ms-be1019 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [12:49:59] (03PS2) 10Paladox: Gerrit: Set log level to debug for AccountManager [puppet] - 10https://gerrit.wikimedia.org/r/441032 (https://phabricator.wikimedia.org/T197083) [12:50:01] RECOVERY - Check systemd state on ms-be1019 is OK: OK - running: The system is fully operational [12:50:31] RECOVERY - MD RAID on ms-be1019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:50:48] well you could do that through set log level so no point in me doing it in puppet. [12:50:49] https://gerrit-review.googlesource.com/Documentation/cmd-logging-set-level.html [12:51:13] (03Abandoned) 10Paladox: Gerrit: Set log level to debug for AccountManager [puppet] - 10https://gerrit.wikimedia.org/r/441032 (https://phabricator.wikimedia.org/T197083) (owner: 10Paladox) [12:53:51] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T197676#4299439 (10Joe) This seemed to be an issue with the smartarray controller; a simple hard reboot fixed the issue. [12:54:01] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T197676#4299582 (10Joe) 05Open>03Resolved p:05Triage>03Low [13:05:36] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4299619 (10thcipriani) >>! In T197275#4296760, @Reedy wrote: > Although we need t... [13:07:48] 10Operations, 10Release-Engineering-Team: Scap error from mwdebug2001.codfw.wmnet: sync: write failed on "/srv/mediawiki/wmf-config/InitialiseSettings.php": No space left on device (28) - https://phabricator.wikimedia.org/T197275#4299637 (10Reedy) Ok, that makes sense. And further answers Joes question/query t... [13:31:59] (03CR) 10DCausse: [C: 031] Add Lexemes to instant-index set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440420 (https://phabricator.wikimedia.org/T196896) (owner: 10Smalyshev) [13:44:34] 10Operations, 10Wikimedia-Mailing-lists: New closed communication public policy mailing list needed - https://phabricator.wikimedia.org/T196041#4299699 (10herron) Hello, I'd be happy to create a new list for you. One question -- In terms of naming, since this is intended to be the private channel for the exis... [14:03:45] (03PS1) 10Herron: puppetdb: double nginx client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/441044 [14:06:19] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetdb: double nginx client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/441044 (owner: 10Herron) [14:06:38] !log temporarily disabling puppet agents for deploy of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441044/ [14:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:58] (03CR) 10Herron: [C: 032] puppetdb: double nginx client_max_body_size [puppet] - 10https://gerrit.wikimedia.org/r/441044 (owner: 10Herron) [14:11:45] !log increased client_max_body_size on puppetdb nginx frontends from 30m to 60m https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441044/ [14:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:30] !log re-enabling puppet agents [14:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:15] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:41:16] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Enable FileImporter on Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441013 (https://phabricator.wikimedia.org/T196969) (owner: 10WMDE-Fisch) [14:41:36] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Enable FileExpoter on ar-, de- and fa-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441014 (https://phabricator.wikimedia.org/T196969) (owner: 10WMDE-Fisch) [14:43:56] (03CR) 10Krinkle: [C: 031] Move CLI overrides after InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440543 (https://phabricator.wikimedia.org/T197475) (owner: 10Anomie) [14:52:28] (03PS2) 10WMDE-Fisch: Enable FileExpoter on ar-, de- and fa-wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441014 (https://phabricator.wikimedia.org/T196969) [15:01:15] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [15:01:26] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [15:08:41] (03PS1) 10Papaul: DNS: Add production DNS entries for dns200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/441055 (https://phabricator.wikimedia.org/T196493) [15:10:09] 10Operations, 10ops-codfw, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns200[12].wikimedia.org - https://phabricator.wikimedia.org/T196493#4299957 (10Papaul) [15:11:15] (03PS16) 10DCausse: Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [15:11:17] (03PS1) 10DCausse: Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) [15:11:19] (03PS1) 10DCausse: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) [15:12:31] (03CR) 10jerkins-bot: [V: 04-1] Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [15:12:44] (03CR) 10jerkins-bot: [V: 04-1] Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [15:12:46] (03CR) 10jerkins-bot: [V: 04-1] Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [15:13:35] 10Operations, 10ops-codfw, 10DNS, 10Traffic, 10netops: switch port configuration for dns200[1-2] - https://phabricator.wikimedia.org/T197697#4299965 (10Papaul) p:05Triage>03Normal [15:17:51] (03CR) 10jerkins-bot: [V: 04-1] Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [15:17:53] (03CR) 10jerkins-bot: [V: 04-1] Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [15:18:03] (03CR) 10jerkins-bot: [V: 04-1] Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [15:20:20] (03PS18) 10DCausse: Add cirrussearch settings for wikibase (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) [15:20:22] (03PS3) 10DCausse: Add cirrussearch settings for wikibase (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441056 (https://phabricator.wikimedia.org/T182717) [15:20:24] (03PS3) 10DCausse: Add cirrussearch settings for wikibase (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441057 (https://phabricator.wikimedia.org/T182717) [15:26:44] (03PS1) 10Papaul: DHCP: Add MAC address entries for dns200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/441059 (https://phabricator.wikimedia.org/T196493) [15:29:26] 10Operations, 10ops-codfw: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4300027 (10Papaul) [15:33:17] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4300047 (10herron) [15:39:02] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access for mbsantos - https://phabricator.wikimedia.org/T197237#4300055 (10herron) Thanks for the group details! This access request will now be reviewed for approval at the next weekly SRE/Ops meeting. The next meeting occurs on Monday, Jun 25.... [15:41:40] 10Operations, 10Wikimedia-Mailing-lists: New closed communication public policy mailing list needed - https://phabricator.wikimedia.org/T196041#4300058 (10Dimi_z) Yes, you can do that. I would also appreciate publicpolicy-cc (as in closed communications) but both are fine by me. Thanks a ton! Le mar. 19 ju... [15:53:42] 10Operations, 10ops-codfw: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4300073 (10Papaul) [15:56:07] 10Operations, 10ops-codfw: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4300081 (10Papaul) Switch port information : both servers are racked in D8 wmf6652 ge-8/0/3 wmf6653 ge-8/0/4 [15:57:34] 10Operations, 10ops-codfw: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4300083 (10Papaul) [16:00:55] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Connect [16:02:35] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 66 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:05:25] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 24, down: 0, shutdown: 0 [16:06:45] (03PS1) 10Papaul: DNS: Fix mgmt asset tag for dns2002 [dns] - 10https://gerrit.wikimedia.org/r/441062 (https://phabricator.wikimedia.org/T196493) [16:07:36] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:09:40] !log mobrovac@deploy1001 Started deploy [proton/deploy@43af7d9]: (no justification provided) [16:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:07] !log mobrovac@deploy1001 Finished deploy [proton/deploy@43af7d9]: (no justification provided) (duration: 00m 27s) [16:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:25] <_joe_> wow already done? [16:11:23] yup [16:14:47] (03PS1) 10Papaul: DNS: Add DNS asset tag mgmt for spare servers [dns] - 10https://gerrit.wikimedia.org/r/441063 (https://phabricator.wikimedia.org/T196493) [16:17:57] (03PS2) 10Papaul: DNS: Add DNS asset tag mgmt for spare servers [dns] - 10https://gerrit.wikimedia.org/r/441063 (https://phabricator.wikimedia.org/T196666) [16:18:50] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4300165 (10Papaul) [16:19:46] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666#4265061 (10Papaul) a:05Papaul>03RobH @RobH done at my end [16:20:46] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active [16:21:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 22, down: 0, shutdown: 2 [16:25:06] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:26:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:30:12] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 326 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [16:31:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 6 probes of 305 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:31:34] 10Operations, 10Analytics, 10User-Elukey: dbstore1002 disk 5 not healthy - https://phabricator.wikimedia.org/T193738#4300231 (10Marostegui) The disk finally failed [16:33:44] ACKNOWLEDGEMENT - MegaRAID on db1053 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T197706 [16:33:48] 10Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T197706#4300233 (10ops-monitoring-bot) [16:41:43] PROBLEM - MegaRAID on dbstore1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:41:54] ACKNOWLEDGEMENT - MegaRAID on dbstore1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T197707 [16:41:58] 10Operations, 10ops-eqiad: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707#4300251 (10ops-monitoring-bot) [16:49:49] ACKNOWLEDGEMENT - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: Herron Telia Carrier Reference: 00863727 We are seeing an outage in the New York area affecting your circuit. This is a suspected cable fault [16:49:49] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: Herron Telia Carrier Reference: 00863727 We are seeing an outage in the New York area affecting your circuit. This is a suspected cable fault [16:53:23] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707#4300279 (10Marostegui) [17:01:59] 10Operations, 10Analytics, 10User-Elukey: dbstore1002 disk 5 not healthy - https://phabricator.wikimedia.org/T193738#4300297 (10Marostegui) [17:02:02] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T197707#4300300 (10Marostegui) [17:11:58] (03PS1) 10Jayprakash12345: Enable Special:Import option in ta.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441069 [17:12:50] RECOVERY - Device not healthy -SMART- on dbstore1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbstore1002&var-datasource=eqiad%2520prometheus%252Fops [17:21:58] (03PS1) 10Urbanecm: Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441071 [17:24:26] (03Abandoned) 10Urbanecm: Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441071 (owner: 10Urbanecm) [17:31:38] (03PS2) 10Jayprakash12345: Enable Special:Import option in ta.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441069 (https://phabricator.wikimedia.org/T196445) [17:39:38] (03CR) 10Urbanecm: [C: 031] "Maybe say "Add import sources to ta.wiktionary" instead of "Enable Special:Import" => it is already enabled, it just must have sources :)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441069 (https://phabricator.wikimedia.org/T196445) (owner: 10Jayprakash12345) [17:44:40] (03PS3) 10Jayprakash12345: Add import sources to ta.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441069 (https://phabricator.wikimedia.org/T196445) [17:44:58] (03CR) 10Urbanecm: [C: 031] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441069 (https://phabricator.wikimedia.org/T196445) (owner: 10Jayprakash12345) [17:45:06] (03CR) 10Jayprakash12345: "@Urbanecm Thanks for review :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441069 (https://phabricator.wikimedia.org/T196445) (owner: 10Jayprakash12345) [17:59:30] PROBLEM - configured eth on db1053 is CRITICAL: vboxnet0 reporting no carrier. [18:02:51] 10Operations, 10Puppet, 10Patch-For-Review: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444#4300471 (10Aklapper) [18:03:32] (03PS1) 10Revi: Add wikimania2018wiki to commonsupload.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441076 (https://phabricator.wikimedia.org/T197714) [18:05:46] 10Puppet, 10Beta-Cluster-Infrastructure, 10ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Puppet broken on deployment-ores01 due to missing hieradata - https://phabricator.wikimedia.org/T184478#4300508 (10Aklapper) [18:05:57] 10Operations, 10Operations-Software-Development, 10Continuous-Integration-Config: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#4300510 (10Aklapper) [18:06:11] 10Operations, 10Puppet, 10Goal: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4300512 (10Aklapper) [18:06:17] 10Operations, 10Puppet: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564#4300513 (10Aklapper) [18:14:52] 10Operations, 10Wikimedia-Mailing-lists: comunicação_BR1 - https://phabricator.wikimedia.org/T197717#4300527 (10Kaioduarte-TB) [18:23:07] 10Operations, 10Wikimedia-Mailing-lists: comunicação_BR1 - https://phabricator.wikimedia.org/T197717#4300563 (10Aklapper) 05Open>03Invalid @Kaioduarte-TB: Please use T196552 instead. [18:23:25] 10Operations, 10Wikimedia-Mailing-lists: comunicação_BR1 - https://phabricator.wikimedia.org/T197717#4300568 (10Aklapper) a:05Kaioduarte-TB>03None [18:24:48] 10Operations, 10Wikimedia-Mailing-lists: New closed communication public policy mailing list needed - https://phabricator.wikimedia.org/T196041#4300586 (10herron) Great! `publicpolicy-private` has been created with `dimi@wikimedia.be` and `anna@wikimedia.be` as owners. The initial list password details shoul... [18:24:59] 10Operations, 10Wikimedia-Mailing-lists: New closed communication public policy mailing list needed - https://phabricator.wikimedia.org/T196041#4300588 (10herron) p:05Triage>03Normal [18:46:07] 10Operations, 10Community-Liaisons, 10Wikimedia-Mailing-lists: comunicação_BR1 - https://phabricator.wikimedia.org/T197717#4300635 (10Kaioduarte-TB) esta tarefa foi criada pelo o motivo basico de reunirmos usuarios brasileiros dos projetos: //[[ https://pt.wikipedia.org | wikipedia ]]// //[[ https://wikime... [18:46:46] 10Operations, 10Community-Liaisons, 10Wikimedia-Mailing-lists: comunicação_BR1 - https://phabricator.wikimedia.org/T197717#4300638 (10Kaioduarte-TB) p:05Triage>03Unbreak! [18:49:19] 10Operations, 10Wikimedia-Mailing-lists: comunicação_BR1 - https://phabricator.wikimedia.org/T196552#4300647 (10Kaioduarte-TB) 05stalled>03Open p:05Lowest>03Unbreak! a:03Kaioduarte-TB [18:55:04] 10Operations, 10Wikimedia-Mailing-lists: comunicação_BR1 - https://phabricator.wikimedia.org/T196552#4300663 (10Aklapper) 05Open>03Invalid p:05Unbreak!>03Triage Well, not like that... @Kaioduarte-TB: It does not look like you are interested in a constructive dialog so I am closing this as invalid. [18:58:17] 10Operations, 10Community-Liaisons, 10Wikimedia-Mailing-lists: comunicação_BR1 - https://phabricator.wikimedia.org/T197717#4300684 (10Aklapper) p:05Unbreak!>03Triage @Kaioduarte-TB: See T196552 instead. Your actions on Phabricator (renaming tasks to gibberish, adding/removing random project tags from tas... [19:56:01] 10Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T197706#4300765 (10Marostegui) 05Open>03Resolved a:03Marostegui This server is going to be decommissioned T194634 [20:08:34] (03PS1) 10Hagar Shilo: CORS whitelist chapter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441096 (https://phabricator.wikimedia.org/T181165) [20:11:14] (03CR) 10jerkins-bot: [V: 04-1] CORS whitelist chapter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441096 (https://phabricator.wikimedia.org/T181165) (owner: 10Hagar Shilo) [20:15:40] (03PS2) 10Hagar Shilo: CORS whitelist chapter wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441096 (https://phabricator.wikimedia.org/T181165) [20:22:43] (03CR) 10Reedy: [C: 04-1] "These should only be Wikimedia hosted chapter wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441096 (https://phabricator.wikimedia.org/T181165) (owner: 10Hagar Shilo) [20:48:01] (03PS1) 10Herron: hue: change smtp_host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/441130 (https://phabricator.wikimedia.org/T196920) [20:48:03] (03PS1) 10Herron: iegreview: change smtp_host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/441131 (https://phabricator.wikimedia.org/T196920) [20:48:05] (03PS1) 10Herron: oozie: change smtp_host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/441132 (https://phabricator.wikimedia.org/T196920) [20:48:07] (03PS1) 10Herron: wikimania_scholarships: change smtp_host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/441133 (https://phabricator.wikimedia.org/T196920) [20:48:09] (03PS1) 10Herron: sentry: change EMAIL_HOST to localhost [puppet] - 10https://gerrit.wikimedia.org/r/441134 (https://phabricator.wikimedia.org/T196920) [20:48:13] (03PS1) 10Herron: wikidump: change smtpserver to localhost [puppet] - 10https://gerrit.wikimedia.org/r/441135 (https://phabricator.wikimedia.org/T196920) [20:48:48] Reedy, blast, ninja'd [20:49:23] 10Operations, 10ops-codfw, 10DNS, 10Traffic, 10netops: switch port configuration for dns200[1-2] - https://phabricator.wikimedia.org/T197697#4300899 (10Peachey88) [20:49:26] heh [20:51:36] 10Operations, 10Patch-For-Review, 10User-herron, 10Wikimedia-Incident: Add email queueing/failover to services currently using mail_smarthost[0] - https://phabricator.wikimedia.org/T196920#4300903 (10herron) [20:52:14] 10Operations, 10Mail, 10Phabricator, 10Release-Engineering-Team, and 3 others: Phabricator outbound email seems to have a SPOF of mx1001 - https://phabricator.wikimedia.org/T196916#4300906 (10herron) [22:42:18] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186#4301029 (10bd808) >>! In T194186#4239504, @Cmjohnson wrote: > @chasemp please let me know network requirements. We would like 10G if poss... [23:31:00] PROBLEM - Host mw1272 is DOWN: PING CRITICAL - Packet loss = 100% [23:52:15] 10Operations, 10Traffic, 10User-Johan, 10User-notice: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4253860 (10TTO) At https://en.wikipedia.org/sec-warning why does the English text not refer to the section at the bottom with techni... [23:52:44] 10Operations, 10Wikimedia-Mailing-lists: New mail list for Signpost team - https://phabricator.wikimedia.org/T197732#4301121 (10Brianhe) [23:53:07] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4301132 (10pmiazga) @akosiaris > answers to your questions, also a very important note about handling 503 errors. >! In T186748#4294644, @ako... [23:56:37] 10Operations, 10Wikimedia-Mailing-lists: New mail list for Signpost team - https://phabricator.wikimedia.org/T197732#4301133 (10Brianhe) [23:56:57] 10Operations, 10Wikimedia-Mailing-lists: New mail list for Signpost team - https://phabricator.wikimedia.org/T197732#4301121 (10Brianhe)