[00:03:25] <icinga-wm>	 RECOVERY - HP RAID on ms-be1034 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK
[00:14:00] <wikibugs>	 (03Abandoned) 10Krinkle: Remove wgTemplateStylesAllowedUrls override (matches default) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486829 (owner: 10Krinkle)
[00:29:29] <c>	 3/6
[00:31:09] <icinga-wm>	 RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[00:38:16] <icinga-wm>	 PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[00:38:44] <icinga-wm>	 RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING WARNING - Packet loss = 93%, RTA = 83.32 ms
[00:39:37] <paladox>	 hmm
[00:39:39] <paladox>	 uh
[00:39:51] <paladox>	 any ops around ^^?
[00:41:35] <SPF|Cloud>	 paladox: no ping loss towards esams from home :P
[00:41:45] <paladox>	 me too
[04:05:29] <icinga-wm>	 PROBLEM - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0
[04:05:30] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214813
[04:05:35] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10ops-monitoring-bot)
[04:37:18] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 5, Working: 5, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214814
[04:37:23] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214814 (10ops-monitoring-bot)
[05:09:16] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on thumbor2002 is CRITICAL: CRITICAL: State: degraded, Active: 4, Working: 4, Failed: 2, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T214815
[05:09:21] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214815 (10ops-monitoring-bot)
[05:32:17] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Peachey88)
[05:32:19] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214814 (10Peachey88)
[05:44:44] <wikibugs>	 10Operations, 10serviceops, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Az1568)
[05:58:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 227.85 seconds
[06:29:41] <icinga-wm>	 PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh]
[06:45:21] <icinga-wm>	 PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27532 MB (5% inode=99%)
[06:55:29] <icinga-wm>	 RECOVERY - Disk space on elastic1017 is OK: DISK OK
[06:55:57] <icinga-wm>	 RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[07:51:37] <wikibugs>	 10Operations, 10IDS-extension, 10Wikimedia Taiwan, 10Wikimedia-Extension-setup, and 2 others: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693 (10Shizhao)
[08:45:37] <wikibugs>	 10Operations: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Daniel_Mietchen)
[08:50:33] <rxy>	 !next
[08:50:46] <rxy>	 jouncebot: next
[08:50:46] <jouncebot>	 In 169 hour(s) and 39 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190204T1030)
[08:54:20] <wikibugs>	 (03PS1) 10Rxy: Enable CheckUser at Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820)
[09:02:10] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1028.37 seconds
[09:20:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: fix script naming for run-parts [puppet] - 10https://gerrit.wikimedia.org/r/486822 (https://phabricator.wikimedia.org/T87001) (owner: 10BryanDavis)
[09:31:27] <p858snake>	 rxy: its all hands this week iirc
[09:31:54] <p858snake>	 rxy: yeap, no deploys https://wikitech.wikimedia.org/wiki/Deployments#Week_of_January_28th
[09:32:57] <wikibugs>	 10Operations, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10Peachey88)
[10:10:46] <wikibugs>	 10Operations, 10Maps (Kartotherian): Create discovery entry for Kartotherian - https://phabricator.wikimedia.org/T214672 (10Mathew.onipe) From investigation, I think there's a dns discovery entry for kartotherian. here:  https://github.com/wikimedia/operations-dns/blob/master/templates/wmnet#L5189,  https://gi...
[10:10:58] <wikibugs>	 10Operations, 10Maps (Kartotherian): Create discovery entry for Kartotherian - https://phabricator.wikimedia.org/T214672 (10Mathew.onipe) p:05Triage→03Normal
[10:35:16] <wikibugs>	 10Operations, 10monitoring: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines - https://phabricator.wikimedia.org/T214762 (10jcrespo) @faidon A bit offtopic but related- I filed this because I accepted this patch: https://gerrit.wikimedia.org/r/479741 Should I revert it?
[10:45:41] <jynus>	 !log stop, upgrade and reboot db2055
[10:45:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:57] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: decom: delete labtestnet2001.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/451050 (https://phabricator.wikimedia.org/T201440) (owner: 10Arturo Borrero Gonzalez)
[11:03:22] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: decom: delete labtestnet2001.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/451051 (https://phabricator.wikimedia.org/T201440) (owner: 10Arturo Borrero Gonzalez)
[11:12:30] <wikibugs>	 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10jcrespo)
[11:13:46] <wikibugs>	 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10jcrespo)
[11:15:21] <wikibugs>	 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10jcrespo) I think we can delete `/srv/backups/`? As those are duplicated from those on es2001 / bacula. Agree?
[11:21:15] <jynus>	 !log stop, upgrade and reboot db2062
[11:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:37] <wikibugs>	 10Operations, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10jcrespo) @Jrogers-WMF My understanding would be that Grafana dashboards, (for example https://grafana.wikimedia.org/d/000000278/mysql-aggregated ) are factual data (metrics, numbers) a...
[11:33:58] <wikibugs>	 10Operations, 10WMF-Legal, 10Software-Licensing: Add license statement to Grafana dashboards - https://phabricator.wikimedia.org/T214819 (10jcrespo)
[12:31:58] <icinga-wm>	 PROBLEM - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100%
[12:37:28] <onimisionipe>	 godog: ^
[12:38:12] <jynus>	 he won't be around for a while
[12:39:35] <onimisionipe>	 Its strange that it just went down
[12:39:45] <jynus>	 [8471111.439577] sd 0:1:0:1: rejecting I/O to offline device
[12:39:49] <jynus>	 on logs
[12:40:00] <onimisionipe>	 jynus: is there anybody we can call for this
[12:40:08] <jynus>	 you can call me :-)
[12:40:14] <onimisionipe>	 :)
[12:40:17] <onimisionipe>	 ok
[12:40:37] <jynus>	 swift is supposed to be reliable against a single-machine failure
[12:40:40] <jynus>	 so not too worried
[12:40:47] <onimisionipe>	 I guess so too
[12:41:02] <onimisionipe>	 can we even powercycle it?
[12:41:07] <jynus>	 I think 1034 had hw issues in the past
[12:41:07] <onimisionipe>	 maybe via mgmt?
[12:41:11] <jynus>	 yes we can
[12:41:13] <jynus>	 I am on it
[12:41:17] <jynus>	 but I am trying to avoid it
[12:41:18] <onimisionipe>	 cool
[12:42:03] <onimisionipe>	 godog had to powercycle ms-be1020 yesterday as it caused msbe1034's load to keep going up
[12:42:49] <onimisionipe>	 !log restarting all elatsicsearch instances on relforge1002 to test spicerack command
[12:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:39] <jynus>	 I guess there is no other way- but I am not even sure if it will come back later
[12:45:55] <jynus>	 !log powercycle ms-be1034
[12:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:44] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.37.21:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.21, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f7519796510: Failed to establish a new connection: [Errno 111] Connec
[12:47:06] <onimisionipe>	 ^that's me. Its expected
[12:47:38] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.4.13:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.4.13, port=9200): Read timed out. (read timeout=4)
[12:48:58] <jynus>	 onimisionipe: it was a raid controller failure
[12:49:31] <onimisionipe>	 that has been on for a while now
[12:49:48] <icinga-wm>	 RECOVERY - Host ms-be1034 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms
[12:49:55] <jynus>	 all mount points seem ok, however
[12:50:31] <wikibugs>	 10Operations, 10media-storage: ms-be1034 crash - https://phabricator.wikimedia.org/T214838 (10jcrespo)
[12:53:20] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad: status: yellow, number_of_nodes: 2, unassigned_shards: 10, number_of_pending_tasks: 6, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 82, task_max_waiting_in_queue_millis: 499, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 86.4077669903, active_sha
[12:53:20] <icinga-wm>	 zing_shards: 4, number_of_data_nodes: 2, delayed_unassigned_shards: 0
[12:54:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 82, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active_shards: 103, in
[12:54:06] <icinga-wm>	 : 0, number_of_data_nodes: 2, delayed_unassigned_shards: 0
[12:55:31] <jynus>	 !log stop, upgrade and reboot db2085
[12:55:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:26] <icinga-wm>	 PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 32, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:12:00] <icinga-wm>	 RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 36, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:37:45] <wikibugs>	 10Operations, 10DBA, 10Packaging: db2085 doesn't boot with 4.9.0-8-amd64 - https://phabricator.wikimedia.org/T214840 (10jcrespo)
[13:45:19] <wikibugs>	 (03PS1) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920)
[13:49:16] <icinga-wm>	 PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:50:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe)
[13:52:02] <jynus>	 !log stop, upgrade and reboot db2092
[13:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:26] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 3 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716 (10Lea_WMDE)
[14:05:13] <wikibugs>	 (03PS2) 10Mathew.onipe: elasticsearch_cluster: fix issues from test result [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920)
[14:06:11] <wikibugs>	 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 3 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716 (10Lea_WMDE)
[14:09:49] <wikibugs>	 (03PS2) 10Rxy: Enable CheckUser at Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820)
[14:12:32] <wikibugs>	 (03CR) 10Huji: [C: 04-1] "See Phab ticket." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820) (owner: 10Rxy)
[14:15:42] <icinga-wm>	 RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[14:16:31] <jynus>	 !log stop, upgrade and reboot db2048, this will cause general lag/read only on enwiki/s1-codfw for some minutes
[14:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:46] <wikibugs>	 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10Marostegui) +1 to get rid of it: ` root@dbstore2001:/srv/backups# ls -lhrt | tail -n5 drwx------ 2 dump dump  24K Feb 28  2018 s1.20180228121150 -rw-r--r-- 1 dump dump   86 Feb 28  2018 dump.s3.log drwx------ 2 du...
[15:20:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) revision was not affected: ` root@cumin1001:~$ ./wmfmariadbpy/wmfmariadbpy/compare.py enwiki revision rev_id db1067 db1114 --step=100000 [...] 2019-01-28T15:13:27.821882: row id 8799...
[15:22:24] <thcipriani>	 !log restarting jenkins for update
[15:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:22] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10Ankry) the problematic version restored after revi's delete and history merged on-wiki. Is there still any phabricator-related issue here?
[15:31:23] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10Multimedia, 10media-storage: Lost file Juan_Guaidó.jpg - https://phabricator.wikimedia.org/T213655 (10jcrespo) > Is there still any phabricator-related issue here?  I don't think so- doing as above was my first suggestion, as doing manually is not easy and ca...
[15:49:36] <wikibugs>	 10Operations: ms-be1034 icinga alers - https://phabricator.wikimedia.org/T214796 (10Marostegui) This might be more likely: T214838
[15:56:57] <wikibugs>	 10Operations, 10media-storage: ms-be1034 crash - https://phabricator.wikimedia.org/T214838 (10jcrespo)
[16:12:19] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Update package to 10.1.37 [software] - 10https://gerrit.wikimedia.org/r/486871
[16:13:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update package to 10.1.37 [software] - 10https://gerrit.wikimedia.org/r/486871 (owner: 10Jcrespo)
[16:15:20] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mariadb: Update package to 10.1.37 [software] - 10https://gerrit.wikimedia.org/r/486871 (owner: 10Jcrespo)
[16:15:51] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Add actor to the core list of tables to check [software] - 10https://gerrit.wikimedia.org/r/486872
[16:16:26] <wikibugs>	 (03CR) 10Jforrester: [C: 04-2] "Per task, we need to resolve the legal issues before this could be deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820) (owner: 10Rxy)
[16:16:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add actor to the core list of tables to check [software] - 10https://gerrit.wikimedia.org/r/486872 (owner: 10Jcrespo)
[16:17:08] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Add actor to the core list of tables to check [software] - 10https://gerrit.wikimedia.org/r/486872
[16:18:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add actor to the core list of tables to check [software] - 10https://gerrit.wikimedia.org/r/486872 (owner: 10Jcrespo)
[16:19:37] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Update list of core tables and its primary keys [software] - 10https://gerrit.wikimedia.org/r/486872
[16:20:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update list of core tables and its primary keys [software] - 10https://gerrit.wikimedia.org/r/486872 (owner: 10Jcrespo)
[16:31:07] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214815 (10Zoranzoki21)
[16:31:09] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on thumbor2002 - https://phabricator.wikimedia.org/T214813 (10Zoranzoki21)
[16:36:09] <jynus>	 !log remove backups dir at dbstore2001 T214831
[16:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:13] <stashbot>	 T214831: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831
[16:38:22] <wikibugs>	 10Operations, 10DBA: dbstore2001 low on disk space - https://phabricator.wikimedia.org/T214831 (10jcrespo) 05Open→03Resolved a:03jcrespo https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=12&fullscreen&orgId=1&var-server=dbstore2001&var-datasource=codfw%20prometheus%2Fops&from=154...
[16:41:27] <hashar>	 !log contint1001: cleaning up disk space on /  (docker images)
[16:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:01:12] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool db1114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486525
[17:09:45] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.26 seconds
[17:09:45] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.96 seconds
[17:09:53] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2051 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.67 seconds
[17:09:55] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2073 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.69 seconds
[17:10:03] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2091 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.72 seconds
[17:10:17] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.73 seconds
[17:10:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2090 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.62 seconds
[17:10:51] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.12 seconds
[17:16:51] <wikibugs>	 (03CR) 10Jcrespo: "I checked the core tables, all had no differences." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486525 (owner: 10Jcrespo)
[17:18:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10jcrespo) I checked the core tables at https://gerrit.wikimedia.org/r/486872  I propose to repool the host, as persistent replication stats worked well as expected: https://gerrit.wikimedia.or...
[17:46:49] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.53 seconds
[17:53:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1114 crashed - https://phabricator.wikimedia.org/T214720 (10Marostegui) Go for it! Thanks for checking it!
[19:49:21] <icinga-wm>	 PROBLEM - graphite-labs.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.003 second response time
[19:55:42] <brion>	 !log running final pass of requeueTranscodes.php on all wikis to make sure stray missing VP9 transcodes are cleaned up
[19:55:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:03] <icinga-wm>	 RECOVERY - graphite-labs.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.015 second response time
[20:51:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2091 is OK: OK slave_sql_lag Replication lag: 53.34 seconds
[20:52:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[20:52:43] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2095 is OK: OK slave_sql_lag Replication lag: 1.05 seconds
[20:52:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.36 seconds
[20:52:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[20:52:53] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[20:53:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2073 is OK: OK slave_sql_lag Replication lag: 0.27 seconds
[20:53:05] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db2051 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[20:54:09] <Zoranzoki21>	 !log Starting wikitext export of content of database for education program on srwiki - T174802 (21:54 UTC+1)
[20:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:13] <stashbot>	 T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802
[21:02:23] <Zoranzoki21>	 !log Done wikitext export of content of database for education program on srwiki - T174802 (duration: 8 minutes)
[21:02:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:27] <stashbot>	 T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802
[21:02:48] <revi>	 quick Q: is PHP7 available on private wikis (ie. stewardwiki, otrswiki)?
[21:05:33] <Krenair>	 revi, I thought it should be but I'm not seeing it in my prefs
[21:08:21] <revi>	 yeah that's why I'm asking :-p
[21:08:41] <revi>	 I guess I will have to wait for a week if they wanna fix it
[21:08:46] <revi>	 (all hands)
[21:09:11] <Krenair>	 revi, looks like you can manually set the PHP_ENGINE=php7 cookie and get it that way but I wouldn't advise it
[21:09:20] <revi>	 heh
[21:09:31] <revi>	 yeah I wouldn't do thay risky things
[21:09:37] <revi>	 (and I can't do that on mobile!)
[21:10:27] <Krenair>	 you probably could actually
[21:10:49] <Krenair>	 again, wouldn't, but either on-wiki user JS to set the cookie, or maybe remote dev tools...
[21:10:59] <Krenair>	 should work
[21:11:15] <Krenair>	 the code that handles the MW end of all of this is WikimediaEventsHooks
[21:13:31] <Krenair>	 not clear why it doesn't show up in beta features though... it should
[21:13:47] <revi>	 will file a task then :-p
[21:13:52] <revi>	 easier to track
[21:15:07] <Hauskatze>	 revi: see special:version or check if you can enable it via your prefs.
[21:15:22] <Hauskatze>	 I have it enabled for meta and es and do not see any difference
[21:15:28] <revi>	 Hauskatze: can tou — on Stewiki?
[21:15:50] <Hauskatze>	 revi: geez, log-in and totp just for that? too lazu
[21:15:52] <Hauskatze>	 *lazy
[21:16:09] <revi>	 lol
[21:16:16] <revi>	 'keep me logged in'
[21:16:19] <revi>	 FTW
[21:16:21] <Hauskatze>	 never
[21:16:34] <revi>	 I'm just too lazy
[21:16:41] <Hauskatze>	 just because it's you I'll check that
[21:16:56] <Hauskatze>	 but don't get used to this kind of favours :P
[21:16:58] <revi>	 :-D
[21:18:18] <Hauskatze>	 following up in the sekret chan
[21:30:36] <Nemo_bis>	 T174802... sweet
[21:30:41] <stashbot>	 T174802: Archive and drop education program (ep_*) tables on all wikis - https://phabricator.wikimedia.org/T174802
[21:35:13] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 278.45 seconds
[21:57:16] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] strongswan: Stop supporting trusty [puppet] - 10https://gerrit.wikimedia.org/r/486442 (owner: 10Muehlenhoff)
[21:59:16] <wikibugs>	 (03PS2) 10Marostegui: dbstore.my.cnf.erb: Make InnoDB default [puppet] - 10https://gerrit.wikimedia.org/r/486712 (https://phabricator.wikimedia.org/T213670)
[22:00:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbstore.my.cnf.erb: Make InnoDB default [puppet] - 10https://gerrit.wikimedia.org/r/486712 (https://phabricator.wikimedia.org/T213670) (owner: 10Marostegui)
[22:11:55] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Add help panel link for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486992
[23:16:32] <wikibugs>	 10Operations, 10MediaWiki Language Extension Bundle, 10MediaWiki-Cache, 10Language-Team (Language-2019-January-March), and 5 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Etonkovidova) It seems that...
[23:17:34] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10Etonkovidova)
[23:18:45] <wikibugs>	 (03PS2) 10Sbisson: GrowthExperiments: Add help panel link for cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486992 (owner: 10Kosta Harlan)
[23:24:03] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 56.51 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:25:21] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:37:23] <icinga-wm>	 PROBLEM - Host cp2014 is DOWN: PING CRITICAL - Packet loss = 100%
[23:38:24] <wikibugs>	 (03PS1) 10Phuedx: admin: Add alternative key for phuedx [puppet] - 10https://gerrit.wikimedia.org/r/486995
[23:42:39] <wikibugs>	 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.13; 2019-01-15), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) >>! In T203786#4915156, @Etonkovidova wro...
[23:43:42] <vgutierrez>	 taking a look to cp2014...
[23:44:35] <icinga-wm>	 PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6
[23:44:55] <icinga-wm>	 PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6
[23:44:59] <icinga-wm>	 PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6
[23:45:01] <icinga-wm>	 PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:01] <icinga-wm>	 PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:03] <icinga-wm>	 PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:07] <icinga-wm>	 PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6
[23:45:11] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:11] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:11] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:11] <icinga-wm>	 PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:11] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:19] <icinga-wm>	 PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:25] <icinga-wm>	 PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:25] <icinga-wm>	 PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:29] <icinga-wm>	 PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6
[23:45:31] <icinga-wm>	 PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:35] <icinga-wm>	 PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:35] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:35] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:35] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:39] <icinga-wm>	 PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6
[23:45:39] <icinga-wm>	 PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:39] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:39] <icinga-wm>	 PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:39] <icinga-wm>	 PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6
[23:45:49] <icinga-wm>	 PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:55] <icinga-wm>	 PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:45:55] <icinga-wm>	 PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp2014_v4, cp2014_v6
[23:45:57] <icinga-wm>	 PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:46:05] <vgutierrez>	 that's cp2014
[23:46:12] <vgutierrez>	 only one server affected :)
[23:46:13] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:46:13] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 38 connecting: cp2014_v4, cp2014_v6
[23:46:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:50:05] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:50:49] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic: cp2014 host down - https://phabricator.wikimedia.org/T214872 (10Vgutierrez)
[23:51:22] <vgutierrez>	 !log restarting cp2014 - T214872
[23:51:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:26] <stashbot>	 T214872: cp2014 host down - https://phabricator.wikimedia.org/T214872
[23:55:33] <icinga-wm>	 RECOVERY - Host cp2014 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms
[23:55:33] <icinga-wm>	 RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 40 ESP OK
[23:55:37] <icinga-wm>	 RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 40 ESP OK
[23:55:37] <icinga-wm>	 RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 72 ESP OK
[23:55:37] <icinga-wm>	 RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 40 ESP OK
[23:55:45] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 40 ESP OK
[23:55:45] <icinga-wm>	 RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 40 ESP OK
[23:55:45] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 40 ESP OK
[23:55:45] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 40 ESP OK
[23:55:45] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 40 ESP OK
[23:55:51] <icinga-wm>	 RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 40 ESP OK
[23:55:57] <icinga-wm>	 RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 40 ESP OK
[23:55:59] <icinga-wm>	 RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 72 ESP OK
[23:56:01] <icinga-wm>	 RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 40 ESP OK
[23:56:05] <icinga-wm>	 RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 40 ESP OK
[23:56:09] <icinga-wm>	 RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 72 ESP OK
[23:56:09] <icinga-wm>	 RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 40 ESP OK
[23:56:09] <icinga-wm>	 RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 72 ESP OK
[23:56:09] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 40 ESP OK
[23:56:11] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 40 ESP OK
[23:56:11] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 40 ESP OK
[23:56:11] <icinga-wm>	 RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 40 ESP OK
[23:56:13] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 40 ESP OK
[23:56:13] <icinga-wm>	 RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 40 ESP OK
[23:56:23] <icinga-wm>	 RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 72 ESP OK
[23:56:25] <icinga-wm>	 RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 40 ESP OK
[23:56:25] <icinga-wm>	 RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 72 ESP OK
[23:56:31] <icinga-wm>	 RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 40 ESP OK
[23:56:31] <icinga-wm>	 RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 40 ESP OK
[23:56:43] <icinga-wm>	 RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 72 ESP OK
[23:56:45] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 40 ESP OK
[23:56:45] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 40 ESP OK
[23:56:49] <icinga-wm>	 RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 72 ESP OK
[23:58:55] <phuedx>	 Any chance I can get a review of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/486995/? I'm on a freshly rebuilt laptop and my current production key is on my home machine
[23:59:45] <wikibugs>	 (03CR) 10Gehel: "minor comments inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/486858 (https://phabricator.wikimedia.org/T207920) (owner: 10Mathew.onipe)