[00:00:06] (03PS4) 10Awight: Tolerate empty tables [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116) [00:00:16] (03CR) 10Awight: "I don't have +2 here, FWIW" [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight) [00:02:51] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krinkle) [00:03:20] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krinkle) [00:05:02] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Krinkle) [00:05:06] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Krinkle) [00:05:08] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krinkle) [00:06:02] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Krinkle) [00:17:49] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [00:17:50] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0 [00:17:59] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [00:31:09] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [00:36:09] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:37:30] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [00:37:39] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [00:39:20] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:47:00] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:52:39] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:14:39] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54525 MB (3% inode=99%) [01:22:09] RECOVERY - Disk space on maps1001 is OK: DISK OK [02:02:10] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:05:30] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:24:59] PROBLEM - SSH on ms-be1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:19] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:27:30] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:28:00] RECOVERY - SSH on ms-be1036 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [02:31:49] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [03:26:40] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 803.26 seconds [03:59:40] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 10 failures. Last run 3 minutes ago with 10 failures. Failed resources (up to 3 shown): Exec[create_user-prometheus@localhost],Exec[create_user-replication@labsdb1006.eqiad.wmnet-v4],Exec[create_user-osm@labs],Exec[create_user-kolossos@labs] [04:16:40] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 153.57 seconds [04:19:39] (03PS2) 10Krinkle: grafana: Remove varnish-http-errors dashboard [puppet] - 10https://gerrit.wikimedia.org/r/445336 [04:44:20] PROBLEM - SSH on ms-be1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:45:09] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:45:20] RECOVERY - SSH on ms-be1036 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [04:54:50] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10Marostegui) I have seen this error (one in the last 8 hours): ``` cli_argv /srv/mediawiki/multiversion/MWScript.php maintenance/cleanupUploadStash.php -... [04:57:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445563 [04:58:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445563 (owner: 10Marostegui) [04:59:09] 10Operations, 10Core-Platform-Team, 10PoolCounter, 10monitoring: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Krinkle) [04:59:34] 10Operations, 10Core-Platform-Team, 10PoolCounter, 10monitoring: High levels of PoolCounter errors should trigger alerts - https://phabricator.wikimedia.org/T133318 (10Krinkle) [05:00:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445563 (owner: 10Marostegui) [05:00:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445563 (owner: 10Marostegui) [05:01:07] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch: Collect metrics on CirrusSearch usage of PoolCounter - https://phabricator.wikimedia.org/T130617 (10Krinkle) [05:01:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1110 for alter table (duration: 00m 52s) [05:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:53] 10Operations, 10ops-eqiad, 10PoolCounter, 10decommission: Decommision poolcounter1002 - https://phabricator.wikimedia.org/T193025 (10Krinkle) [05:06:35] !log Rename wbc_entity_usage.eu_touched column on db1110 - T144010 [05:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:38] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:08:29] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445564 [05:10:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445564 (owner: 10Marostegui) [05:12:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445564 (owner: 10Marostegui) [05:12:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445564 (owner: 10Marostegui) [05:13:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1110 after alter table (duration: 00m 50s) [05:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:57] (03PS1) 10Marostegui: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445565 [05:18:02] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445565 (owner: 10Marostegui) [05:19:43] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445565 (owner: 10Marostegui) [05:20:55] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2068 for grants testing (duration: 00m 50s) [05:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:19] (03CR) 10jenkins-bot: db-codfw.php: Depool db2068 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445565 (owner: 10Marostegui) [05:21:37] !log Stop MySQL on db2068 for upgrade and check unused grants [05:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:37] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445568 [05:48:29] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445568 (owner: 10Marostegui) [05:48:59] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 31 failures. Last run 2 minutes ago with 31 failures. Failed resources (up to 3 shown): Exec[chown /srv/deployment/changeprop for deploy-service],Package[eventstreams/deploy],Exec[chown /srv/deployment/eventstreams for deploy-service],Service[pdfrender] [05:50:09] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445568 (owner: 10Marostegui) [05:50:21] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2068" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445568 (owner: 10Marostegui) [05:51:26] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2068 after maintenance (duration: 00m 49s) [05:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:59] !log stop and reimage db1087 [06:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:19] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 35 failures. Last run 6 minutes ago with 35 failures. Failed resources (up to 3 shown): Exec[chown /srv/deployment/mathoid for deploy-service],Package[graphoid/deploy],Exec[chown /srv/deployment/graphoid for deploy-service],Package[citoid/deploy] [06:04:24] we will create temporary lag on wikireplicas-s8 [06:05:10] scb2002 issues was temporary [06:06:20] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:07:41] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1087 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445569 [06:10:30] PROBLEM - SSH on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:10:30] PROBLEM - configured eth on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:10:30] PROBLEM - apertium apy on scb2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:10:39] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:10:49] PROBLEM - cxserver endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:10:59] PROBLEM - Disk space on scb2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [06:10:59] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [06:11:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve featured article info fo [06:11:00] (with aggregated=true)) timed out before a response was received [06:11:09] PROBLEM - eventstreams on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 8092: Connection refused [06:11:29] RECOVERY - SSH on scb2002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [06:11:30] RECOVERY - apertium apy on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.075 second response time [06:11:30] RECOVERY - configured eth on scb2002 is OK: OK - interfaces up [06:11:39] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [06:11:49] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [06:11:59] RECOVERY - cxserver endpoints health on scb2002 is OK: All endpoints are healthy [06:11:59] RECOVERY - Disk space on scb2002 is OK: DISK OK [06:12:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [06:12:19] RECOVERY - eventstreams on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.092 second response time [06:12:19] could there be connectivity issues on scb2002? [06:13:29] no, there are os issues [06:19:20] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:20:29] (03PS3) 10Elukey: role::kafka::main: raise Kafka Java Xmx/Xms [puppet] - 10https://gerrit.wikimedia.org/r/445304 [06:21:39] (03PS2) 10Nuria: Changing dimensions to be read as numbers [puppet] - 10https://gerrit.wikimedia.org/r/445553 (https://phabricator.wikimedia.org/T167494) [06:22:18] (03CR) 10jerkins-bot: [V: 04-1] Changing dimensions to be read as numbers [puppet] - 10https://gerrit.wikimedia.org/r/445553 (https://phabricator.wikimedia.org/T167494) (owner: 10Nuria) [06:25:27] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11786/" [puppet] - 10https://gerrit.wikimedia.org/r/445304 (owner: 10Elukey) [06:32:14] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_ipmi_sensor] [06:36:10] (03PS1) 10Jcrespo: mariadb: Allow reimage of db1078 & db1079 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/445570 [06:37:23] RECOVERY - DPKG on analytics1072 is OK: All packages OK [06:38:33] RECOVERY - puppet last run on analytics1072 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:41:23] RECOVERY - DPKG on analytics1075 is OK: All packages OK [06:42:13] !log unblocked stuck dpkg processes on an107[2,5] that broke puppet [06:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:24] RECOVERY - puppet last run on analytics1075 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:47:54] (03CR) 10Mobrovac: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/445304 (owner: 10Elukey) [06:50:20] !log powercycle ms-be1041 after diagnostic tests [06:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:12] (03Abandoned) 10Jonas Kress (WMDE): Add monthly storage schema for graphite [puppet] - 10https://gerrit.wikimedia.org/r/443370 (https://phabricator.wikimedia.org/T193641) (owner: 10Jonas Kress (WMDE)) [06:57:43] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:28] (03CR) 10Elukey: [C: 032] role::kafka::main: raise Kafka Java Xmx/Xms [puppet] - 10https://gerrit.wikimedia.org/r/445304 (owner: 10Elukey) [07:00:34] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 48.12, 29.64, 17.30 [07:05:02] (03PS3) 10Arturo Borrero Gonzalez: mariadb: Add prometheus monitoring to labcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/445433 (owner: 10Jcrespo) [07:05:37] (03PS2) 10Muehlenhoff: Switch noc backend for codfw from wasat to mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/445438 [07:09:03] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "The compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/445433 (owner: 10Jcrespo) [07:10:44] (03PS3) 10Muehlenhoff: Switch noc backend for codfw from wasat to mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/445438 [07:11:24] (03CR) 10Muehlenhoff: [C: 032] Switch noc backend for codfw from wasat to mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/445438 (owner: 10Muehlenhoff) [07:13:43] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 13.12, 24.64, 23.88 [07:24:03] PROBLEM - Check systemd state on labcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:36:41] 10Operations, 10monitoring: Alert on negative disk space available - https://phabricator.wikimedia.org/T199436 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:46:14] (03PS2) 10Jcrespo: mariadb: Allow reimage of db1078 & db1079 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/445570 [07:46:32] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1087 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445569 (owner: 10Jcrespo) [07:47:01] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db1078 & db1079 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/445570 (owner: 10Jcrespo) [07:47:49] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1087 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445569 (owner: 10Jcrespo) [07:50:02] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1087 (duration: 00m 51s) [07:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:13] RECOVERY - Check systemd state on labcontrol1004 is OK: OK - running: The system is fully operational [07:55:39] (03PS3) 10Elukey: Changing dimensions to be read as numbers [puppet] - 10https://gerrit.wikimedia.org/r/445553 (https://phabricator.wikimedia.org/T167494) (owner: 10Nuria) [07:55:47] (03PS4) 10Elukey: Changing dimensions to be read as numbers [puppet] - 10https://gerrit.wikimedia.org/r/445553 (https://phabricator.wikimedia.org/T167494) (owner: 10Nuria) [08:02:34] (03PS1) 10Muehlenhoff: Enable microcode on restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/445573 (https://phabricator.wikimedia.org/T127825) [08:13:16] (03CR) 10Elukey: [C: 032] "Tested manually on thorium with Joseph, works!" [puppet] - 10https://gerrit.wikimedia.org/r/445553 (https://phabricator.wikimedia.org/T167494) (owner: 10Nuria) [08:20:28] (03PS3) 10Giuseppe Lavagetto: mediawiki_test: split wikimania.conf [puppet] - 10https://gerrit.wikimedia.org/r/444240 [08:21:27] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki_test: split wikimania.conf [puppet] - 10https://gerrit.wikimedia.org/r/444240 (owner: 10Giuseppe Lavagetto) [08:22:33] (03PS1) 10Jcrespo: mariadb: Depool db1078 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445575 [08:25:14] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1078 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445575 (owner: 10Jcrespo) [08:26:52] (03Merged) 10jenkins-bot: mariadb: Depool db1078 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445575 (owner: 10Jcrespo) [08:27:46] (03CR) 10jenkins-bot: mariadb: Depool db1078 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445575 (owner: 10Jcrespo) [08:30:53] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10MoritzMuehlenhoff) I reviewed the kernel logs on the Swift servers we picked as canaries to test microcode updates and I noticed that we also various kernel er... [08:34:27] (03PS1) 10Muehlenhoff: Enable microcode for Swift backend servers [puppet] - 10https://gerrit.wikimedia.org/r/445576 (https://phabricator.wikimedia.org/T127825) [08:40:06] 10Operations, 10DBA, 10MediaWiki-Database: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) [08:42:55] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, 10User-Joe: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10jcrespo) [08:43:35] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, 10User-Joe: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10jcrespo) [08:44:02] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, 10User-Joe: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10jcrespo) [08:47:43] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 (duration: 00m 50s) [08:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:56] (03CR) 10Filippo Giunchedi: [C: 031] Enable microcode for Swift backend servers [puppet] - 10https://gerrit.wikimedia.org/r/445576 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:49:08] (03CR) 10Filippo Giunchedi: [C: 031] Enable microcode on restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/445573 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [08:50:42] (03CR) 10Filippo Giunchedi: "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445336 (owner: 10Krinkle) [08:52:14] RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational [08:54:13] that was me, fixing the stuck sessions [08:58:57] (03PS3) 10Muehlenhoff: Decommission terbium [puppet] - 10https://gerrit.wikimedia.org/r/445423 (https://phabricator.wikimedia.org/T192092) [08:59:20] (03PS3) 10Muehlenhoff: Update grants for terbium->mwmaint1001 migration and wasat rename [puppet] - 10https://gerrit.wikimedia.org/r/445421 (https://phabricator.wikimedia.org/T192092) [08:59:23] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:00:25] (03CR) 10Muehlenhoff: [C: 032] Update grants for terbium->mwmaint1001 migration and wasat rename [puppet] - 10https://gerrit.wikimedia.org/r/445421 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [09:05:07] !log apply updated grants to m5 hosts [09:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:18] 10Operations, 10Wikimedia-Logstash, 10Goal: Logstash/Kibana architecture review - https://phabricator.wikimedia.org/T198754 (10fgiunchedi) Non exhaustive list of things that we'll need to address: * More insight into logstash/kibana activity via prometheus metrics (elasticsearch already has prometheus metric... [09:13:04] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10jcrespo) There is an undocumented grant from californium.wikimedia.org to striker @bd808 - I will delete it if it is not puppetized it. I will create a separate... [09:20:15] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10MoritzMuehlenhoff) Let's wait for confirmation by Bryan, but californium is up for decom (replaced by the labweb* hosts), so 99.9% sure this can go away. [09:21:33] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10jcrespo) I have created T199518. [09:21:56] (03PS3) 10Giuseppe Lavagetto: mediawiki_test: complete the transition to one wiki per template. [puppet] - 10https://gerrit.wikimedia.org/r/444241 [09:23:36] (03PS1) 10Muehlenhoff: Remove conditionals for older distros in mediawiki_maintenance profile [puppet] - 10https://gerrit.wikimedia.org/r/445580 [09:25:14] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10jcrespo) No more grants on m5 referencing 10.64.32.13 (terbium): ``` $ ./software/dbtools/section m5 | while read host port; do mysql.py -BN -h$host:$port -e "s... [09:27:07] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki_test: complete the transition to one wiki per template. [puppet] - 10https://gerrit.wikimedia.org/r/444241 (owner: 10Giuseppe Lavagetto) [09:27:46] !log stop and reimage db1078 [09:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:34] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron: make agents heartbeats configurable [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) [09:30:10] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron: make agents heartbeats configurable [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:31:40] (03CR) 10Vgutierrez: [C: 031] Cleanup NaiveBGPPeeringTestCase [debs/pybal] - 10https://gerrit.wikimedia.org/r/436769 (owner: 10Mark Bergsma) [09:33:05] (03PS1) 10Giuseppe Lavagetto: mediawiki_test: fix order of inclusion of virtualhosts in wikimedia.conf [puppet] - 10https://gerrit.wikimedia.org/r/445583 [09:33:25] !log installing ruby-sprockets security updates on jessie [09:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:25] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki_test: fix order of inclusion of virtualhosts in wikimedia.conf [puppet] - 10https://gerrit.wikimedia.org/r/445583 (owner: 10Giuseppe Lavagetto) [09:35:13] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron: make agents heartbeats configurable [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) [09:35:56] !log aaron@deploy1001 sync-dir aborted: (no justification provided) (duration: 00m 01s) [09:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:53] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.12/includes/libs/objectcache: 5efa9f67ed5230b41fef4bb504a7c46939f8b4c6 (duration: 00m 51s) [09:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:18] (03PS3) 10Gehel: Enable fetching constraints for Updater [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [09:39:48] (03PS3) 10Arturo Borrero Gonzalez: openstack: neutron: make agents heartbeats configurable [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) [09:41:04] (03PS4) 10Arturo Borrero Gonzalez: openstack: neutron: make agents heartbeats configurable [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) [09:45:17] !log installing libpng security updates on trusty (Debian already updated) [09:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:47] (03PS5) 10Arturo Borrero Gonzalez: openstack: neutron: make agents heartbeats configurable [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) [09:53:16] (03PS1) 10Muehlenhoff: Add library hints for libpng [puppet] - 10https://gerrit.wikimedia.org/r/445584 [09:54:11] (03PS2) 10Muehlenhoff: Add library hints for libpng [puppet] - 10https://gerrit.wikimedia.org/r/445584 [09:55:02] (03CR) 10Muehlenhoff: [C: 032] Add library hints for libpng [puppet] - 10https://gerrit.wikimedia.org/r/445584 (owner: 10Muehlenhoff) [09:57:11] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler is happy:" [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [09:57:20] (03PS6) 10Arturo Borrero Gonzalez: openstack: neutron: make agents heartbeats configurable [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) [09:57:55] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] openstack: neutron: make agents heartbeats configurable [puppet] - 10https://gerrit.wikimedia.org/r/445582 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [10:04:02] (03CR) 10Giuseppe Lavagetto: rake: add ability to check syntax of dhcp files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/441537 (owner: 10Giuseppe Lavagetto) [10:04:32] (03PS5) 10Giuseppe Lavagetto: rake: add ability to check syntax of dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/441537 [10:05:51] (03CR) 10Giuseppe Lavagetto: [C: 032] rake: add ability to check syntax of dhcp files [puppet] - 10https://gerrit.wikimedia.org/r/441537 (owner: 10Giuseppe Lavagetto) [10:12:58] !log installing cups/libcups security updates on stretch/trusty [10:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:26] (03PS1) 10Ladsgroup: labs: Set $wgTagStatisticsNewTable to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445586 (https://phabricator.wikimedia.org/T199334) [10:15:24] (03CR) 10Alexandros Kosiaris: [C: 032] grafana: use host-overview in favour of server-board for featured dashboard [puppet] - 10https://gerrit.wikimedia.org/r/444219 (https://phabricator.wikimedia.org/T178690) (owner: 10Filippo Giunchedi) [10:15:31] (03PS2) 10Alexandros Kosiaris: grafana: use host-overview in favour of server-board for featured dashboard [puppet] - 10https://gerrit.wikimedia.org/r/444219 (https://phabricator.wikimedia.org/T178690) (owner: 10Filippo Giunchedi) [10:16:28] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: reimage+rename labnet1003 as cloudnet1003 [puppet] - 10https://gerrit.wikimedia.org/r/445587 (https://phabricator.wikimedia.org/T199521) [10:18:10] (03CR) 10Ladsgroup: [C: 032] labs: Set $wgTagStatisticsNewTable to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445586 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [10:20:02] (03Abandoned) 10Giuseppe Lavagetto: Revert "mediawiki_test: convert all of main.conf to individual sites" [puppet] - 10https://gerrit.wikimedia.org/r/444214 (owner: 10Giuseppe Lavagetto) [10:20:09] (03Merged) 10jenkins-bot: labs: Set $wgTagStatisticsNewTable to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445586 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [10:20:22] (03CR) 10jenkins-bot: labs: Set $wgTagStatisticsNewTable to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445586 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [10:20:56] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: reimage+rename labnet1003 as cloudnet1003 [dns] - 10https://gerrit.wikimedia.org/r/445589 (https://phabricator.wikimedia.org/T199521) [10:21:20] (03CR) 10ArielGlenn: "Let's get rid of references to it here too: modules/role/templates/mariadb/grants/tendril.sql.erb (by IP)." [puppet] - 10https://gerrit.wikimedia.org/r/445423 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [10:24:15] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: reimage+rename labnet1003 as cloudnet1003 [puppet] - 10https://gerrit.wikimedia.org/r/445587 (https://phabricator.wikimedia.org/T199521) (owner: 10Arturo Borrero Gonzalez) [10:24:23] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: reimage+rename labnet1003 as cloudnet1003 [dns] - 10https://gerrit.wikimedia.org/r/445589 (https://phabricator.wikimedia.org/T199521) (owner: 10Arturo Borrero Gonzalez) [10:24:25] (03PS2) 10Arturo Borrero Gonzalez: cloudvps: reimage+rename labnet1003 as cloudnet1003 [puppet] - 10https://gerrit.wikimedia.org/r/445587 (https://phabricator.wikimedia.org/T199521) [10:29:06] (03CR) 10Muehlenhoff: "Good catch, I'll split this off to a separate patch, though." [puppet] - 10https://gerrit.wikimedia.org/r/445423 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [10:31:30] (03PS1) 10Muehlenhoff: Remove terbium for tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/445590 [10:33:58] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [10:34:06] (03CR) 10ArielGlenn: [C: 031] "In that case, here's my +1. And maybe worth removng hhvm/php by hand over there after, so no one tries to use the host (since the motd war" [puppet] - 10https://gerrit.wikimedia.org/r/445423 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [10:35:33] (03CR) 10ArielGlenn: [C: 031] "The DBAs will love you :-)" [puppet] - 10https://gerrit.wikimedia.org/r/445590 (owner: 10Muehlenhoff) [10:37:09] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [10:41:22] (03PS1) 10Arturo Borrero Gonzalez: install_server: add autoinstall recipe for cloudnet1003 [puppet] - 10https://gerrit.wikimedia.org/r/445591 (https://phabricator.wikimedia.org/T199521) [10:42:15] (03CR) 10Arturo Borrero Gonzalez: [C: 032] install_server: add autoinstall recipe for cloudnet1003 [puppet] - 10https://gerrit.wikimedia.org/r/445591 (https://phabricator.wikimedia.org/T199521) (owner: 10Arturo Borrero Gonzalez) [11:00:48] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain} [11:00:48] title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received [11:00:48] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:00:58] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed out before a response w [11:00:58] PROBLEM - eventstreams on scb2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:01:28] PROBLEM - pdfrender on scb2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:01:38] PROBLEM - SSH on scb2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:01:38] PROBLEM - Check systemd state on scb2004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:01:58] PROBLEM - nutcracker process on scb2004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:02:48] RECOVERY - nutcracker process on scb2004 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [11:03:18] RECOVERY - pdfrender on scb2004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time [11:03:29] RECOVERY - SSH on scb2004 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [11:03:38] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [11:03:39] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [11:03:40] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [11:03:49] RECOVERY - eventstreams on scb2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.100 second response time [11:03:58] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [11:10:18] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 28 failures. Last run 6 minutes ago with 28 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter],Service[diamond],Service[apertium-apy],Exec[chown /etc/cpjobqueue/config.yaml] [11:12:16] so I have seen instability on scb2* hosts lately [11:12:21] (03PS1) 10Muehlenhoff: Revert "Switch dbtree over to mwmaint1001" [puppet] - 10https://gerrit.wikimedia.org/r/445596 [11:12:25] does anyone know what could that be? [11:12:51] akosiaris, elukey, mobrovac ^ do you know of any maintenance or issues there? [11:13:21] this time scb2004, but before it was 2002 [11:13:23] nope, what's this [11:13:25] ... [11:13:41] on the other I saw strange kernel messages (memmory pressure, maybe?) [11:13:48] I have not checked this one [11:13:53] (yet) [11:14:28] I am going to guess is some stalling as a synthom [11:14:37] 10Operations, 10ops-eqiad: Relabel labnet1003.eqiad.wmnet as cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T199524 (10aborrero) [11:14:38] OOM [11:14:41] (overload on one of the services?) [11:14:55] yes, that would be consistent with what I saw on the other [11:14:55] euh? [11:14:59] which one oom-ed? [11:15:15] can some of you have a look, I am handling 2 other things, cannot look more, sorry [11:15:21] https://grafana.wikimedia.org/dashboard/db/host-overview?refresh=300s&panelId=4&fullscreen&orgId=1&var-server=scb2004&var-datasource=codfw%20prometheus%2Fops&var-cluster=scb [11:15:26] jynus: yeah I got this [11:16:42] (03PS2) 10Muehlenhoff: Revert "Switch dbtree over to mwmaint1001" [puppet] - 10https://gerrit.wikimedia.org/r/445596 [11:17:18] mobrovac: 2002 and 2004 today, 2003 yesterday, 2001 on Jul 10 and 2005 on Jul 7 [11:17:32] (03CR) 10Muehlenhoff: [C: 032] Revert "Switch dbtree over to mwmaint1001" [puppet] - 10https://gerrit.wikimedia.org/r/445596 (owner: 10Muehlenhoff) [11:17:36] killed a nodejs process of course [11:19:18] uh that must be mobileapps then, most likely [11:19:29] yeah looks like it [11:20:55] changeprop on scb2003 has a virt of 11.5 G ? [11:21:09] well everywhere not just scb2003 [11:21:15] and cpjobqueue another 11.5G virt [11:21:32] event if the RSS is way way lower... what on earth ? [11:23:06] anyway, it's not really mobileapps [11:23:12] it's eventstreams that has the most RSS [11:23:34] eventstreams or changeprop? [11:23:40] depending on the box anything from 600M to 2.5G [11:23:53] eventstreams is the greatest RSS user [11:24:28] that would make sense as both have catching up to do, given the outage [11:24:42] for VIRT memory it's cpjobqueue on most boxes but VIRT is generally irrelevant [11:24:59] for CP we can increase the concurrency a bit so that it goes through it quicker, but there's not a lot we can do for eventstreams [11:25:09] yeah [11:25:34] if we increase the cp concurrency we will be in even greater memory pressure though [11:26:18] 10Operations, 10Traffic: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) [11:27:09] 10Operations, 10Traffic: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) p:05Triage>03Normal [11:28:29] (03PS1) 10Jcrespo: dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) [11:28:58] mobileapps rss seems pretty stable over the last 7 days [11:29:11] (03CR) 10jerkins-bot: [V: 04-1] dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [11:29:29] mobrovac: why are the scb2* nodes showing up this behavior? If there was catching up I'd have imagined scb1* ? [11:29:34] what am I missing? [11:30:10] but eventstreams talks to the eqiad main cluster only [11:30:20] the graph seems to support that idea - https://grafana.wikimedia.org/dashboard/db/eventstreams?refresh=1m&panelId=6&fullscreen&orgId=1 [11:30:22] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:30:39] (03PS2) 10Jcrespo: dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) [11:30:52] (03CR) 10Muehlenhoff: "Ack, we have the /root/decomission_appserver script for that purpose" [puppet] - 10https://gerrit.wikimedia.org/r/445423 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [11:31:07] hm but if zoom out to 30 days there, it's a regular occurrence [11:31:10] * mobrovac sighs [11:31:30] (03CR) 10jerkins-bot: [V: 04-1] dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [11:32:44] hm and now we're back to normal [11:33:47] (03CR) 10Jcrespo: "This is not pretty, but it is easier than tryingh to refactor to do the profile/role in a single step. However, it is likely to break due " [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [11:33:53] so eventstreams in codfw uses main-eqiad as well? [11:34:23] yes, apparently [11:34:43] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=4&fullscreen&orgId=1&var-server=scb2004&var-datasource=codfw%20prometheus%2Fops&from=1530772635560&to=1531481613660 [11:34:43] got to run now [11:34:50] there is a pattern definetely there [11:34:59] something is memory leaking [11:35:38] this seems to have started around 2018-07-01 [11:35:59] hm [11:36:13] zooming out to 90 days shows that has happened before [11:36:15] (03CR) 10Jcrespo: dbtree: move dbtree outside of mwmaint hosts (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [11:36:15] damn [11:36:24] k, got to run now for an errand, bbl [11:37:45] The pattern however is different lately than in the 90 days ago interval [11:38:44] (03PS3) 10Jcrespo: dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) [11:39:19] (03CR) 10jerkins-bot: [V: 04-1] dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [11:46:00] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:46:11] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1078 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445599 [11:47:32] (03PS1) 10Jcrespo: mariadb: Repool db1078 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445601 [11:55:48] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1078 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445601 (owner: 10Jcrespo) [11:57:25] (03Merged) 10jenkins-bot: mariadb: Repool db1078 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445601 (owner: 10Jcrespo) [11:58:02] (03CR) 10jenkins-bot: mariadb: Repool db1078 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445601 (owner: 10Jcrespo) [11:59:00] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 with low load (duration: 00m 50s) [11:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:25] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10MoritzMuehlenhoff) [12:15:27] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:20:03] (03PS1) 10Reedy: Add fluidsynth to wikimedia servers [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) [12:20:05] (03PS1) 10Reedy: Remove timidity and freepats [puppet] - 10https://gerrit.wikimedia.org/r/445604 [12:20:41] (03CR) 10jerkins-bot: [V: 04-1] Add fluidsynth to wikimedia servers [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) (owner: 10Reedy) [12:21:05] (03CR) 10jerkins-bot: [V: 04-1] Remove timidity and freepats [puppet] - 10https://gerrit.wikimedia.org/r/445604 (owner: 10Reedy) [12:21:25] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1078 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445599 [12:28:43] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: cleanup labnet1003 entries [dns] - 10https://gerrit.wikimedia.org/r/445606 (https://phabricator.wikimedia.org/T199521) [12:29:07] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: cleanup labnet1003 entries [dns] - 10https://gerrit.wikimedia.org/r/445606 (https://phabricator.wikimedia.org/T199521) (owner: 10Arturo Borrero Gonzalez) [12:31:58] (03CR) 10Muehlenhoff: "Timidty used to be used by the Score extension, is that no longer the case?" [puppet] - 10https://gerrit.wikimedia.org/r/445604 (owner: 10Reedy) [12:33:06] (03CR) 10Reedy: [C: 04-1] "It's hopefully going away soon - https://phabricator.wikimedia.org/T181897" [puppet] - 10https://gerrit.wikimedia.org/r/445604 (owner: 10Reedy) [12:33:21] (03PS2) 10Reedy: Add fluidsynth to wikimedia servers [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) [12:33:23] (03PS2) 10Reedy: Remove timidity and freepats [puppet] - 10https://gerrit.wikimedia.org/r/445604 [12:34:19] (03CR) 10Muehlenhoff: "Definitely needs a better commit message, then :-)" [puppet] - 10https://gerrit.wikimedia.org/r/445604 (owner: 10Reedy) [12:34:53] (03CR) 10Reedy: "The commit summary describes perfectly what the commit does!" [puppet] - 10https://gerrit.wikimedia.org/r/445604 (owner: 10Reedy) [12:51:39] 10Operations, 10Traffic: Investigate NXDOMAIN DNS responses in our authdns servers - https://phabricator.wikimedia.org/T199525 (10Vgutierrez) from a 5 minutes traffic capture the following domains belong to the top 10 that are actually owned by the WMF but non configured in our DNS servers: ``` 4820 wikepedia.... [12:55:27] (03PS1) 10Vgutierrez: set up parking dns zones for the top 10 of current NXDOMAIN responses [dns] - 10https://gerrit.wikimedia.org/r/445611 (https://phabricator.wikimedia.org/T199525) [13:07:20] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. To more paranoid approach would be to add the new caches definition in addition to cache_misc, wait for puppet to distri" [puppet] - 10https://gerrit.wikimedia.org/r/445126 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [13:10:10] (03PS1) 10Jcrespo: mariadb: Depool db1079 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445614 [13:13:37] (03PS4) 10Jcrespo: dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) [13:13:39] (03PS1) 10Jcrespo: mariadb: Return to not reimage any db servers by default [puppet] - 10https://gerrit.wikimedia.org/r/445615 [13:14:02] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:19] (03CR) 10jerkins-bot: [V: 04-1] dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [13:17:12] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [13:40:49] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030 (10ayounsi) 05Open>03stalled Latest news, Fiberstore optics are not qualified for the MX204, only Finisar are. Waiting on T199483 to move forward here. [13:40:51] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [13:45:34] (03CR) 10Ebe123: [C: 04-1] "Let me block this until I'm ready." [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) (owner: 10Reedy) [13:45:52] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1078 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445599 (owner: 10Jcrespo) [13:47:02] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff This host got reimaged with stretch (and renamed to mwmaint2001), so this is resolved. [13:47:10] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1078 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445599 (owner: 10Jcrespo) [13:47:27] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [13:48:14] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1078 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445599 (owner: 10Jcrespo) [13:48:23] (03CR) 10Gehel: Enable fetching constraints for Updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [13:50:55] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1079 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445614 (owner: 10Jcrespo) [13:51:15] (03PS1) 10Muehlenhoff: Update DNS config for wasat rename [dns] - 10https://gerrit.wikimedia.org/r/445617 (https://phabricator.wikimedia.org/T193915) [13:51:57] (03PS1) 10Elukey: profile::analytics::gitconfig: use system level configuration for git [puppet] - 10https://gerrit.wikimedia.org/r/445618 (https://phabricator.wikimedia.org/T198623) [13:52:13] (03Merged) 10jenkins-bot: mariadb: Depool db1079 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445614 (owner: 10Jcrespo) [13:52:29] (03CR) 10jenkins-bot: mariadb: Depool db1079 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445614 (owner: 10Jcrespo) [13:52:52] (03CR) 10Elukey: [C: 032] profile::analytics::gitconfig: use system level configuration for git [puppet] - 10https://gerrit.wikimedia.org/r/445618 (https://phabricator.wikimedia.org/T198623) (owner: 10Elukey) [13:54:25] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 fully, depool db1079 (duration: 00m 50s) [13:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:20] (03PS1) 10Elukey: profile::analytics::cluster::gitconfig: fix previous pebkac [puppet] - 10https://gerrit.wikimedia.org/r/445619 (https://phabricator.wikimedia.org/T198623) [13:57:23] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:57:56] yeah this is me [13:58:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 756.83 seconds [14:00:45] (03CR) 10Elukey: [C: 032] profile::analytics::cluster::gitconfig: fix previous pebkac [puppet] - 10https://gerrit.wikimedia.org/r/445619 (https://phabricator.wikimedia.org/T198623) (owner: 10Elukey) [14:00:53] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:02:33] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:03:53] !log stop and reimage db1079 [14:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:03] PROBLEM - MariaDB Slave IO: s7 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1079.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1079.eqiad.wmnet (111 Connection refused) [14:09:27] that is expected [14:09:33] I will silence it [14:12:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 270.60 seconds [14:13:10] (03PS2) 10Jcrespo: mariadb: Return to not reimage any db servers by default [puppet] - 10https://gerrit.wikimedia.org/r/445615 [14:13:42] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1079 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445623 [14:16:03] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:21:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) So now on stat* and notebook* we have a /etc/gitconfig rule that forces all git users to use the http[s] proxy. The conf1006 fl... [14:25:26] (03PS9) 10Giuseppe Lavagetto: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) [14:26:21] (03CR) 10jerkins-bot: [V: 04-1] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [14:27:16] RECOVERY - MariaDB Slave IO: s7 on db1125 is OK: OK slave_io_state Slave_IO_Running: Yes [14:28:27] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) @MoritzMuehlenhoff interesting! I'll collect more info and post upstream with those too. Diagnostics on two machines haven't yielded anything wrt... [14:28:40] (03PS1) 10Jcrespo: mariadb: Repool db1079 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445625 [14:31:30] (03CR) 10Jcrespo: [C: 032] mariadb: Return to not reimage any db servers by default [puppet] - 10https://gerrit.wikimedia.org/r/445615 (owner: 10Jcrespo) [14:32:10] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1079 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445625 (owner: 10Jcrespo) [14:33:30] (03Merged) 10jenkins-bot: mariadb: Repool db1079 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445625 (owner: 10Jcrespo) [14:33:45] (03CR) 10jenkins-bot: mariadb: Repool db1079 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445625 (owner: 10Jcrespo) [14:34:53] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 with low load (duration: 00m 50s) [14:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:55] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1079 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445623 [14:56:23] (03PS3) 10Andrew Bogott: mwopenstackclients.py: allow specifying 'region' to the nova client [puppet] - 10https://gerrit.wikimedia.org/r/445463 [14:56:25] (03PS1) 10Andrew Bogott: Horizon: support multiple regions [puppet] - 10https://gerrit.wikimedia.org/r/445626 [14:57:11] (03CR) 10Andrew Bogott: [C: 032] mwopenstackclients.py: allow specifying 'region' to the nova client [puppet] - 10https://gerrit.wikimedia.org/r/445463 (owner: 10Andrew Bogott) [14:58:52] 10Operations, 10Goal: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10MoritzMuehlenhoff) p:05Triage>03High [14:58:52] (03PS2) 10Andrew Bogott: Horizon: support multiple regions [puppet] - 10https://gerrit.wikimedia.org/r/445626 [15:02:08] (03CR) 10Andrew Bogott: [C: 032] Horizon: support multiple regions [puppet] - 10https://gerrit.wikimedia.org/r/445626 (owner: 10Andrew Bogott) [15:05:21] (03PS1) 10Andrew Bogott: Horizon: add some forgotten quote-marks [puppet] - 10https://gerrit.wikimedia.org/r/445627 [15:06:00] (03CR) 10Andrew Bogott: [C: 032] Horizon: add some forgotten quote-marks [puppet] - 10https://gerrit.wikimedia.org/r/445627 (owner: 10Andrew Bogott) [15:18:57] PROBLEM - Check systemd state on labtestcontrol2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:24:31] (03PS1) 10Andrew Bogott: Horizon: mark out the REGION section for now [puppet] - 10https://gerrit.wikimedia.org/r/445631 [15:26:57] (03CR) 10Andrew Bogott: [C: 032] Horizon: mark out the REGION section for now [puppet] - 10https://gerrit.wikimedia.org/r/445631 (owner: 10Andrew Bogott) [15:35:17] RECOVERY - Check systemd state on labtestcontrol2003 is OK: OK - running: The system is fully operational [16:26:56] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 35, down: 2, dormant: 0, excluded: 0, unused: 0 [16:27:06] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [16:27:46] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0 [16:28:46] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 [16:33:27] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [16:33:37] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [16:42:33] what was that? :) [16:44:13] (03CR) 10Eevans: [C: 031] Enable microcode on restbase servers [puppet] - 10https://gerrit.wikimedia.org/r/445573 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [17:25:57] (03CR) 10Smalyshev: Enable fetching constraints for Updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:32:10] (03CR) 10Anomie: "I've confirmed that all wikis on wmf.12 currently have the variable set to the same thing this patch will set them to. But, since it's Fri" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [17:37:09] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) In addition to T198623#4415961 We have notebook1003 and notebook1004 sending `ICMPv6 Multicast Listener Report` every 2 minute... [17:54:08] (03PS3) 10Daniel Kinzler: wgMultiContentRevisionSchemaMigrationStage SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [17:54:17] (03CR) 10jerkins-bot: [V: 04-1] wgMultiContentRevisionSchemaMigrationStage SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [17:55:27] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) https://www.ietf.org/proceedings/50/I-D/nfsv4-rpc-ipv6-00.txt ``` IPv6 enabled RPC service must join a well known multicast... [18:10:06] (03PS4) 10Daniel Kinzler: wgMultiContentRevisionSchemaMigrationStage SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [18:32:18] (03PS1) 10Nuria: Adding acomputed measure of ratio of bot requests on pageview datasets [puppet] - 10https://gerrit.wikimedia.org/r/445654 [18:33:18] !log droping undocumented grants on m5 T199518 [18:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:27] T199518: Undocumented grants on striker from californium - https://phabricator.wikimedia.org/T199518 [18:35:26] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1079 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445623 (owner: 10Jcrespo) [18:37:19] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1079 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445623 (owner: 10Jcrespo) [18:39:50] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1079 fully (duration: 00m 50s) [18:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:53] !log reindexing Mirandese wikis on elastic@codfw (T197890) [18:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:57] T197890: Re-index Mirandese Wikis - https://phabricator.wikimedia.org/T197890 [18:45:06] (03CR) 10Dereckson: Test spaces in ExtraNamespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [18:45:28] (03CR) 10ArielGlenn: [C: 032] "Yeah my wording was bad; it was 'I can merge this whenever you like'. Anyhow, I shall merge it now. :-D" [dumps] - 10https://gerrit.wikimedia.org/r/404987 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight) [18:48:36] !log reindexing Mirandese wikis on elastic@eqiad (T197890) [18:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:40] T197890: Re-index Mirandese Wikis - https://phabricator.wikimedia.org/T197890 [19:15:56] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1079 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445623 (owner: 10Jcrespo) [19:52:56] (03PS2) 10Nuria: Adding acomputed measure of ratio of bot requests on pageview datasets [puppet] - 10https://gerrit.wikimedia.org/r/445654 [19:58:21] (03PS9) 10Framawiki: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) [19:58:49] (03CR) 10Framawiki: Test spaces in ExtraNamespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [20:00:02] (03CR) 10Framawiki: Test spaces in ExtraNamespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [20:03:41] (03CR) 10Krinkle: "@Imarlier: Hm.., rename it how? I haven't changed the name of the role in this commit." [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [20:09:12] (03CR) 10Alex Monk: [C: 032] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [20:11:18] (03Merged) 10jenkins-bot: get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [20:11:44] (03PS3) 10Krinkle: grafana: Remove varnish-http-errors dashboard [puppet] - 10https://gerrit.wikimedia.org/r/445336 [20:12:03] (03CR) 10jenkins-bot: get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [20:21:13] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1051 - https://phabricator.wikimedia.org/T195484 (10RobH) [20:24:38] (03PS1) 10RobH: d1051 decom from repo [puppet] - 10https://gerrit.wikimedia.org/r/445665 (https://phabricator.wikimedia.org/T195484) [20:25:14] (03PS1) 10RobH: decom db1051 prod dns [dns] - 10https://gerrit.wikimedia.org/r/445666 (https://phabricator.wikimedia.org/T195484) [20:25:34] (03CR) 10RobH: [C: 032] d1051 decom from repo [puppet] - 10https://gerrit.wikimedia.org/r/445665 (https://phabricator.wikimedia.org/T195484) (owner: 10RobH) [20:26:05] (03CR) 10RobH: [C: 032] decom db1051 prod dns [dns] - 10https://gerrit.wikimedia.org/r/445666 (https://phabricator.wikimedia.org/T195484) (owner: 10RobH) [20:27:10] 10Operations, 10Cloud-Services: Missing Labs hiera entry in labs-private repo - https://phabricator.wikimedia.org/T152767 (10Krenair) 05Open>03Resolved a:03Volans Moved that to T199575 [20:28:52] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1054 - https://phabricator.wikimedia.org/T197063 (10RobH) a:05Cmjohnson>03None [20:30:16] (03CR) 10ArielGlenn: [C: 04-1] "The standalone script works well, these I have not yet tested but there are a couple small issues in line." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) (owner: 10Smalyshev) [20:30:53] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259 (10Krenair) [20:30:55] 10Operations, 10Beta-Cluster-Infrastructure, 10HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006 (10Krenair) [20:31:09] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371 (10Krenair) 05Open>03Resolved a:03fgiunchedi [20:32:01] (03PS1) 10Smalyshev: Remove labs wikis from the list, don't need them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445720 [20:32:23] (03CR) 10C. Scott Ananian: [C: 031] "It's been suggested that the Traffic team needs to review/push this. Brandon, is that you?" [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [20:32:26] 10Puppet, 10Beta-Cluster-reproducible, 10Patch-For-Review: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946 (10Krenair) 05Open>03Resolved I haven't. [20:35:48] (03CR) 10ArielGlenn: [C: 031] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445720 (owner: 10Smalyshev) [20:36:15] (03PS12) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) [20:36:17] (03CR) 10Smalyshev: Generate daily diffs for categories RDF (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) (owner: 10Smalyshev) [20:36:42] 10Operations, 10Cloud-Services: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402 (10Krenair) See also {T196252} [20:37:47] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1054 - https://phabricator.wikimedia.org/T197063 (10RobH) [20:38:37] (03PS1) 10RobH: decom db1054 [dns] - 10https://gerrit.wikimedia.org/r/445722 (https://phabricator.wikimedia.org/T197063) [20:39:38] (03PS1) 10RobH: decom db1054 from repo [puppet] - 10https://gerrit.wikimedia.org/r/445723 (https://phabricator.wikimedia.org/T197063) [20:39:43] (03CR) 10RobH: [C: 032] decom db1054 [dns] - 10https://gerrit.wikimedia.org/r/445722 (https://phabricator.wikimedia.org/T197063) (owner: 10RobH) [20:40:15] (03CR) 10RobH: [C: 032] decom db1054 from repo [puppet] - 10https://gerrit.wikimedia.org/r/445723 (https://phabricator.wikimedia.org/T197063) (owner: 10RobH) [20:40:22] (03PS2) 10Alex Monk: Remove labs wikis from the categories-rdf list, don't need them [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445720 (owner: 10Smalyshev) [20:41:23] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1054 - https://phabricator.wikimedia.org/T197063 (10RobH) [20:41:37] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1054 - https://phabricator.wikimedia.org/T197063 (10RobH) a:03Cmjohnson [20:47:59] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941 (10Krenair) Right now if you have a project puppetmaster and want to add a new instance to your project, to do it safely, you need to create your instance first, get it hooked up to... [20:53:18] 10Operations, 10Cloud-Services: Moving network::external to hiera broke much of labs - https://phabricator.wikimedia.org/T141959 (10Krenair) Was this ever done? Was it some puppet version thing perhaps? [21:16:22] (03CR) 10Ebe123: [C: 04-1] "Was gone; couldn't make these comments until now." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/445603 (https://phabricator.wikimedia.org/T184598) (owner: 10Reedy) [21:29:50] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941 (10bd808) >>! In T152941#4423902, @Krenair wrote: > Maybe have it check for puppet.{project}.wmflabs.org so projects can CNAME that to their chosen puppetmaster. At least, assuming t... [21:39:04] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941 (10Krenair) >>! In T152941#4423954, @bd808 wrote: >>>! In T152941#4423902, @Krenair wrote: >> Maybe have it check for puppet.{project}.wmflabs.org so projects can CNAME that to their... [21:59:41] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:59:51] PROBLEM - nutcracker process on scb2005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:00:01] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /en.wikipedia.org/v1/page/media/{title}{/revision} (Get media in test page) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received [22:00:11] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: / (spec from root) timed out before a response was received: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retrieve en-wiktionary definitions for cat) timed out before a response was received [22:00:13] PROBLEM - SSH on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:21] PROBLEM - pdfrender on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:21] PROBLEM - eventstreams on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:21] PROBLEM - apertium apy on scb2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:00:51] RECOVERY - nutcracker process on scb2005 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [22:01:02] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [22:01:12] RECOVERY - SSH on scb2005 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [22:01:12] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [22:01:21] RECOVERY - pdfrender on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.074 second response time [22:01:21] RECOVERY - apertium apy on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.077 second response time [22:01:21] RECOVERY - eventstreams on scb2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.105 second response time [22:01:42] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy [22:51:51] PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Puppet has 57 failures. Last run 5 minutes ago with 57 failures. Failed resources (up to 3 shown): Package[tzdata],Package[apport],Package[command-not-found],Package[command-not-found-data] [23:17:12] RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [23:37:11] PROBLEM - eventstreams on scb2001 is CRITICAL: connect to address 10.192.32.132 and port 8092: Connection refused [23:41:14] 10Operations, 10ops-eqiad: Relabel labnet1003.eqiad.wmnet as cloudnet1003.eqiad.wmnet - https://phabricator.wikimedia.org/T199524 (10Peachey88) [23:46:52] RECOVERY - eventstreams on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.086 second response time