[00:03:25] RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [00:14:00] (03Abandoned) 10Krinkle: Remove wgTemplateStylesAllowedUrls override (matches default) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486829 (owner: 10Krinkle) [00:29:29] 3/6 [00:31:09] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:38:16] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [00:38:44] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 83.32 ms [00:39:37] hmm [00:39:39] uh [00:39:51] any ops around ^^? [00:41:35] paladox: no ping loss towards esams from home :P [00:41:45] me too [04:05:29] PROBLEM - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 [04:05:30] ACKNOWLEDGEMENT - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214813 [04:05:35] 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10ops-monitoring-bot) [04:37:18] ACKNOWLEDGEMENT - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214814 [04:37:23] 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214814 (10ops-monitoring-bot) [05:09:16] ACKNOWLEDGEMENT - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 4, Working: 4, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214815 [05:09:21] 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214815 (10ops-monitoring-bot) [05:32:17] 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Peachey88) [05:32:19] 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214814 (10Peachey88) [05:44:44] 10Operations, 10serviceops, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Az1568) [05:58:57] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 227.85 seconds [06:29:41] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:45:21] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27532 MB (5% inode=99%) [06:55:29] RECOVERY - Disk space on elastic1017 is OK: DISK OK [06:55:57] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:51:37] 10Operations, 10IDS-extension, 10Wikimedia Taiwan, 10Wikimedia-Extension-setup, and 2 others: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693 (10Shizhao) [08:45:37] 10Operations: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Daniel_Mietchen) [08:50:33] !next [08:50:46] jouncebot: next [08:50:46] In 169 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T1030) [08:54:20] (03PS1) 10Rxy: Enable CheckUser at Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820) [09:02:10] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1028.37 seconds [09:20:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: fix script naming for run-parts [puppet] - 10https://gerrit.wikimedia.org/r/486822 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis) [09:31:27] rxy: its all hands this week iirc [09:31:54] rxy: yeap, no deploys https://wikitech.wikimedia.org/wiki/Deployments#Week_of_January_28th [09:32:57] 10Operations, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Peachey88) [10:10:46] 10Operations, 10Maps (Kartotherian): Create discovery entry for Kartotherian - https://phabricator.wikimedia.org/T214672 (10Mathew.onipe) From investigation, I think there's a dns discovery entry for kartotherian. here: https://github.com/wikimedia/operations-dns/blob/master/templates/wmnet#L5189, https://gi... [10:10:58] 10Operations, 10Maps (Kartotherian): Create discovery entry for Kartotherian - https://phabricator.wikimedia.org/T214672 (10Mathew.onipe) p:05Triage→03Normal [10:35:16] 10Operations, 10monitoring: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines - https://phabricator.wikimedia.org/T214762 (10jcrespo) @faidon A bit offtopic but related- I filed this because I accepted this patch: https://gerrit.wikimedia.org/r/479741 Should I revert it? [10:45:41] !log stop, upgrade and reboot db2055 [10:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:57] (03Abandoned) 10Arturo Borrero Gonzalez: decom: delete labtestnet2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/451050 (https://phabricator.wikimedia.org/T201440) (owner: 10Arturo Borrero Gonzalez) [11:03:22] (03Abandoned) 10Arturo Borrero Gonzalez: decom: delete labtestnet2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/451051 (https://phabricator.wikimedia.org/T201440) (owner: 10Arturo Borrero Gonzalez) [11:12:30] 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10jcrespo) [11:13:46] 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10jcrespo) [11:15:21] 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10jcrespo) I think we can delete `/srv/backups/`? As those are duplicated from those on es2001 / bacula. Agree? [11:21:15] !log stop, upgrade and reboot db2062 [11:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:37] 10Operations, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10jcrespo) @Jrogers-WMF My understanding would be that Grafana dashboards, (for example https://grafana.wikimedia.org/d/000000278/mysql-aggregated ) are factual data (metrics, numbers) a... [11:33:58] 10Operations, 10WMF-Legal, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10jcrespo) [12:31:58] PROBLEM - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% [12:37:28] godog: ^ [12:38:12] he won't be around for a while [12:39:35] Its strange that it just went down [12:39:45] [8471111.439577] sd 0:1:0:1: rejecting I/O to offline device [12:39:49] on logs [12:40:00] jynus: is there anybody we can call for this [12:40:08] you can call me :-) [12:40:14] :) [12:40:17] ok [12:40:37] swift is supposed to be reliable against a single-machine failure [12:40:40] so not too worried [12:40:47] I guess so too [12:41:02] can we even powercycle it? [12:41:07] I think 1034 had hw issues in the past [12:41:07] maybe via mgmt? [12:41:11] yes we can [12:41:13] I am on it [12:41:17] but I am trying to avoid it [12:41:18] cool [12:42:03] godog had to powercycle ms-be1020 yesterday as it caused msbe1034's load to keep going up [12:42:49] !log restarting all elatsicsearch instances on relforge1002 to test spicerack command [12:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:39] I guess there is no other way- but I am not even sure if it will come back later [12:45:55] !log powercycle ms-be1034 [12:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:44] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.21, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f7519796510: Failed to establish a new connection: [Errno 111] Connec [12:47:06] ^that's me. Its expected [12:47:38] PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.4.13:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.4.13, port=9200): Read timed out. (read timeout=4) [12:48:58] onimisionipe: it was a raid controller failure [12:49:31] that has been on for a while now [12:49:48] RECOVERY - Host ms-be1034 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [12:49:55] all mount points seem ok, however [12:50:31] 10Operations, 10media-storage: ms-be1034 crash - https://phabricator.wikimedia.org/T214838 (10jcrespo) [12:53:20] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: yellow, number_of_nodes: 2, unassigned_shards: 10, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 82, task_max_waiting_in_queue_millis: 499, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 86.4077669903, active_sha [12:53:20] zing_shards: 4, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [12:54:06] RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 82, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 103, in [12:54:06] : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0 [12:55:31] !log stop, upgrade and reboot db2085 [12:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:26] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 32, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:12:00] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 36, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:37:45] 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10jcrespo) [13:45:19] (03PS1) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) [13:49:16] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:50:46] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe) [13:52:02] !log stop, upgrade and reboot db2092 [13:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:26] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 3 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716 (10Lea_WMDE) [14:05:13] (03PS2) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) [14:06:11] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 3 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716 (10Lea_WMDE) [14:09:49] (03PS2) 10Rxy: Enable CheckUser at Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820) [14:12:32] (03CR) 10Huji: [C: 04-1] "See Phab ticket." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820) (owner: 10Rxy) [14:15:42] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:16:31] !log stop, upgrade and reboot db2048, this will cause general lag/read only on enwiki/s1-codfw for some minutes [14:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:46] 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10Marostegui) +1 to get rid of it: ` root@dbstore2001:/srv/backups# ls -lhrt | tail -n5 drwx------ 2 dump dump 24K Feb 28 2018 s1.20180228121150 -rw-r--r-- 1 dump dump 86 Feb 28 2018 dump.s3.log drwx------ 2 du... [15:20:40] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) revision was not affected: ` root@cumin1001:~$ ./wmfmariadbpy/wmfmariadbpy/compare.py enwiki revision rev_id db1067 db1114 --step=100000 [...] 2019-01-28T15:13:27.821882: row id 8799... [15:22:24] !log restarting jenkins for update [15:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:22] 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10Ankry) the problematic version restored after revi's delete and history merged on-wiki. Is there still any phabricator-related issue here? [15:31:23] 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10jcrespo) > Is there still any phabricator-related issue here? I don't think so- doing as above was my first suggestion, as doing manually is not easy and ca... [15:49:36] 10Operations: ms-be1034 icinga alers - https://phabricator.wikimedia.org/T214796 (10Marostegui) This might be more likely: T214838 [15:56:57] 10Operations, 10media-storage: ms-be1034 crash - https://phabricator.wikimedia.org/T214838 (10jcrespo) [16:12:19] (03PS1) 10Jcrespo: mariadb: Update package to 10.1.37 [software] - 10https://gerrit.wikimedia.org/r/486871 [16:13:14] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update package to 10.1.37 [software] - 10https://gerrit.wikimedia.org/r/486871 (owner: 10Jcrespo) [16:15:20] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Update package to 10.1.37 [software] - 10https://gerrit.wikimedia.org/r/486871 (owner: 10Jcrespo) [16:15:51] (03PS1) 10Jcrespo: mariadb: Add actor to the core list of tables to check [software] - 10https://gerrit.wikimedia.org/r/486872 [16:16:26] (03CR) 10Jforrester: [C: 04-2] "Per task, we need to resolve the legal issues before this could be deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820) (owner: 10Rxy) [16:16:45] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add actor to the core list of tables to check [software] - 10https://gerrit.wikimedia.org/r/486872 (owner: 10Jcrespo) [16:17:08] (03PS2) 10Jcrespo: mariadb: Add actor to the core list of tables to check [software] - 10https://gerrit.wikimedia.org/r/486872 [16:18:03] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add actor to the core list of tables to check [software] - 10https://gerrit.wikimedia.org/r/486872 (owner: 10Jcrespo) [16:19:37] (03PS3) 10Jcrespo: mariadb: Update list of core tables and its primary keys [software] - 10https://gerrit.wikimedia.org/r/486872 [16:20:36] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update list of core tables and its primary keys [software] - 10https://gerrit.wikimedia.org/r/486872 (owner: 10Jcrespo) [16:31:07] 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214815 (10Zoranzoki21) [16:31:09] 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Zoranzoki21) [16:36:09] !log remove backups dir at dbstore2001 T214831 [16:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:13] T214831: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 [16:38:22] 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10jcrespo) 05Open→03Resolved a:03jcrespo https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=12&fullscreen&orgId=1&var-server=dbstore2001&var-datasource=codfw%20prometheus%2Fops&from=154... [16:41:27] !log contint1001: cleaning up disk space on / (docker images) [16:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:12] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486525 [17:09:45] PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.26 seconds [17:09:45] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.96 seconds [17:09:53] PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.67 seconds [17:09:55] PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.69 seconds [17:10:03] PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.72 seconds [17:10:17] PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.73 seconds [17:10:33] PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.62 seconds [17:10:51] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.12 seconds [17:16:51] (03CR) 10Jcrespo: "I checked the core tables, all had no differences." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486525 (owner: 10Jcrespo) [17:18:21] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) I checked the core tables at https://gerrit.wikimedia.org/r/486872 I propose to repool the host, as persistent replication stats worked well as expected: https://gerrit.wikimedia.or... [17:46:49] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.53 seconds [17:53:40] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10Marostegui) Go for it! Thanks for checking it! [19:49:21] PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time [19:55:42] !log running final pass of requeueTranscodes.php on all wikis to make sure stray missing VP9 transcodes are cleaned up [19:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:03] RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.015 second response time [20:51:55] RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 53.34 seconds [20:52:37] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:52:43] RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 1.05 seconds [20:52:53] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [20:52:53] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:52:53] RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:53:05] RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [20:53:05] RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:54:09] !log Starting wikitext export of content of database for education program on srwiki - T174802 (21:54 UTC+1) [20:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:13] T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 [21:02:23] !log Done wikitext export of content of database for education program on srwiki - T174802 (duration: 8 minutes) [21:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:27] T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 [21:02:48] quick Q: is PHP7 available on private wikis (ie. stewardwiki, otrswiki)? [21:05:33] revi, I thought it should be but I'm not seeing it in my prefs [21:08:21] yeah that's why I'm asking :-p [21:08:41] I guess I will have to wait for a week if they wanna fix it [21:08:46] (all hands) [21:09:11] revi, looks like you can manually set the PHP_ENGINE=php7 cookie and get it that way but I wouldn't advise it [21:09:20] heh [21:09:31] yeah I wouldn't do thay risky things [21:09:37] (and I can't do that on mobile!) [21:10:27] you probably could actually [21:10:49] again, wouldn't, but either on-wiki user JS to set the cookie, or maybe remote dev tools... [21:10:59] should work [21:11:15] the code that handles the MW end of all of this is WikimediaEventsHooks [21:13:31] not clear why it doesn't show up in beta features though... it should [21:13:47] will file a task then :-p [21:13:52] easier to track [21:15:07] revi: see special:version or check if you can enable it via your prefs. [21:15:22] I have it enabled for meta and es and do not see any difference [21:15:28] Hauskatze: can tou — on Stewiki? [21:15:50] revi: geez, log-in and totp just for that? too lazu [21:15:52] *lazy [21:16:09] lol [21:16:16] 'keep me logged in' [21:16:19] FTW [21:16:21] never [21:16:34] I'm just too lazy [21:16:41] just because it's you I'll check that [21:16:56] but don't get used to this kind of favours :P [21:16:58] :-D [21:18:18] following up in the sekret chan [21:30:36] T174802... sweet [21:30:41] T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802 [21:35:13] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 278.45 seconds [21:57:16] (03CR) 10Vgutierrez: [C: 03+1] strongswan: Stop supporting trusty [puppet] - 10https://gerrit.wikimedia.org/r/486442 (owner: 10Muehlenhoff) [21:59:16] (03PS2) 10Marostegui: dbstore.my.cnf.erb: Make InnoDB default [puppet] - 10https://gerrit.wikimedia.org/r/486712 (https://phabricator.wikimedia.org/T213670) [22:00:04] (03CR) 10Marostegui: [C: 03+2] dbstore.my.cnf.erb: Make InnoDB default [puppet] - 10https://gerrit.wikimedia.org/r/486712 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui) [22:11:55] (03PS1) 10Kosta Harlan: GrowthExperiments: Add help panel link for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486992 [23:16:32] 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Etonkovidova) It seems that... [23:17:34] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Etonkovidova) [23:18:45] (03PS2) 10Sbisson: GrowthExperiments: Add help panel link for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486992 (owner: 10Kosta Harlan) [23:24:03] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 56.51 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:25:21] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:37:23] PROBLEM - Host cp2014 is DOWN: PING CRITICAL - Packet loss = 100% [23:38:24] (03PS1) 10Phuedx: admin: Add alternative key for phuedx [puppet] - 10https://gerrit.wikimedia.org/r/486995 [23:42:39] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) >>! In T203786#4915156, @Etonkovidova wro... [23:43:42] taking a look to cp2014... [23:44:35] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6 [23:44:55] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6 [23:44:59] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6 [23:45:01] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:01] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:03] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:07] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6 [23:45:11] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:11] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:11] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:11] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:11] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:19] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:25] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:25] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:29] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6 [23:45:31] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:35] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:35] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:35] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:35] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:39] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6 [23:45:39] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:39] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:39] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:39] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6 [23:45:49] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:55] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:45:55] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6 [23:45:57] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:46:05] that's cp2014 [23:46:12] only one server affected :) [23:46:13] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:46:13] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6 [23:46:45] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:50:05] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:50:49] 10Operations, 10ops-codfw, 10Traffic: cp2014 host down - https://phabricator.wikimedia.org/T214872 (10Vgutierrez) [23:51:22] !log restarting cp2014 - T214872 [23:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:26] T214872: cp2014 host down - https://phabricator.wikimedia.org/T214872 [23:55:33] RECOVERY - Host cp2014 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [23:55:33] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK [23:55:37] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 40 ESP OK [23:55:37] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 72 ESP OK [23:55:37] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK [23:55:45] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK [23:55:45] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK [23:55:45] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK [23:55:45] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK [23:55:45] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK [23:55:51] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK [23:55:57] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK [23:55:59] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 72 ESP OK [23:56:01] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK [23:56:05] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK [23:56:09] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 72 ESP OK [23:56:09] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK [23:56:09] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 72 ESP OK [23:56:09] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK [23:56:11] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK [23:56:11] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK [23:56:11] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK [23:56:13] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK [23:56:13] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK [23:56:23] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 72 ESP OK [23:56:25] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 40 ESP OK [23:56:25] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 72 ESP OK [23:56:31] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK [23:56:31] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK [23:56:43] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 72 ESP OK [23:56:45] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK [23:56:45] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK [23:56:49] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 72 ESP OK [23:58:55] Any chance I can get a review of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/486995/? I'm on a freshly rebuilt laptop and my current production key is on my home machine [23:59:45] (03CR) 10Gehel: "minor comments inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe)