[00:17:07] 10Operations, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460#3468636 (10bbogaert) p:05Triage>03Normal [00:40:08] 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on labsdb1001 - https://phabricator.wikimedia.org/T171538#3468270 (10faidon) >>! In T171538#3468451, @chasemp wrote: > hmm > > ```# cat /proc/mdstat > Personalities : > unused devices: ```... [00:59:29] 10Operations, 10Traffic: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498#3468712 (10faidon) I think this is a good idea overall and that we should be doing that. A few points: - I'm worried a little bit that this will hide issues like the ones you mentioned under the c... [01:25:05] (03CR) 10Chad: [C: 031] "Actually, I think this is safe enough to just go ahead then. LDAP_BIND could be messy, and like we said earlier it /probably/ doesn't /req" [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [01:59:06] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3468785 (10Jayprakash12345) Will Quiz Extension be install automatically at the time of wiki creation? Or another task open for this purpose [02:02:35] (03CR) 10Krinkle: Phabricator: Redirect all http traffic to https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) (owner: 10Paladox) [02:11:52] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3468806 (10StevenJ81) p:05High>03Normal Not your place to change the priority. System developers and LangCom members decide that. [03:00:26] !log l10nupdate@tin LocalisationUpdate failed: git pull of extensions failed [03:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:32] (03PS1) 10Reedy: Don't need to update submodules recursively [puppet] - 10https://gerrit.wikimedia.org/r/367639 [03:36:05] (03CR) 10Reedy: "--recursive made it unhappy" [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [04:10:19] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=516.30 Read Requests/Sec=695.10 Write Requests/Sec=1.00 KBytes Read/Sec=45331.60 KBytes_Written/Sec=17.60 [04:18:29] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=7.60 Read Requests/Sec=163.80 Write Requests/Sec=2.10 KBytes Read/Sec=672.00 KBytes_Written/Sec=54.40 [04:18:49] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0 [04:19:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [04:21:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [04:21:59] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [04:51:27] 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on labsdb1001 - https://phabricator.wikimedia.org/T171538#3468270 (10Marostegui) The disk failed is part of a HW RAID10: ``` # megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virt... [05:07:37] (03PS3) 10Foxy brown: Enable Article Reminder feature flag on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367318 (https://phabricator.wikimedia.org/T169354) [05:22:19] (03PS4) 10Foxy brown: Enable Article Reminder feature flag on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367318 (https://phabricator.wikimedia.org/T169354) [05:23:28] (03CR) 10Mattflaschen: [C: 032] Enable Article Reminder feature flag on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367318 (https://phabricator.wikimedia.org/T169354) (owner: 10Foxy brown) [05:25:21] (03Merged) 10jenkins-bot: Enable Article Reminder feature flag on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367318 (https://phabricator.wikimedia.org/T169354) (owner: 10Foxy brown) [05:26:04] (03CR) 10jenkins-bot: Enable Article Reminder feature flag on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367318 (https://phabricator.wikimedia.org/T169354) (owner: 10Foxy brown) [05:29:01] !log mattflaschen@tin Synchronized wmf-config/CommonSettings-labs.php: Article reminder: Beta Cluster only (duration: 00m 44s) [05:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:13] (03CR) 10Krinkle: [C: 031] contint: webperformance Jenkins slave [puppet] - 10https://gerrit.wikimedia.org/r/367411 (https://phabricator.wikimedia.org/T166756) (owner: 10Hashar) [05:46:39] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [06:37:49] RECOVERY - pdfrender on scb2004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.075 second response time [06:44:59] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [07:07:09] (03PS4) 10Giuseppe Lavagetto: rsyslog::conf: validate priority with validate_numeric [puppet] - 10https://gerrit.wikimedia.org/r/365570 [07:09:16] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7145/ the compiler shows this is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/365570 (owner: 10Giuseppe Lavagetto) [07:20:26] (03PS3) 10Giuseppe Lavagetto: sysctl::conffile: validate priority as numeric [puppet] - 10https://gerrit.wikimedia.org/r/365571 [07:23:38] (03CR) 10Giuseppe Lavagetto: [C: 032] "All references in the tree are strictly numeric, should be another noop." [puppet] - 10https://gerrit.wikimedia.org/r/365571 (owner: 10Giuseppe Lavagetto) [07:27:32] !log installing apache security updates on app servers in eqiad [07:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:40] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:09] (03PS1) 10Muehlenhoff: Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 [07:41:54] (03CR) 10jerkins-bot: [V: 04-1] Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 (owner: 10Muehlenhoff) [07:42:46] (03PS2) 10Muehlenhoff: Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 [07:43:57] (03CR) 10jerkins-bot: [V: 04-1] Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 (owner: 10Muehlenhoff) [07:48:16] (03PS3) 10Muehlenhoff: Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 [07:49:13] (03CR) 10jerkins-bot: [V: 04-1] Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 (owner: 10Muehlenhoff) [07:49:25] (03PS7) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [07:49:27] (03PS1) 10Giuseppe Lavagetto: etcd: convert to using systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/367646 [07:49:29] (03PS1) 10Giuseppe Lavagetto: etcdmirror: convert to using systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/367647 [07:50:17] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3469042 (10Marostegui) click_tracking has been backuped: ``` root@dbstore1001:/srv/tmp... [07:50:44] (03CR) 10jerkins-bot: [V: 04-1] etcdmirror: convert to using systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/367647 (owner: 10Giuseppe Lavagetto) [07:55:50] (03PS2) 10Giuseppe Lavagetto: etcdmirror: convert to using systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/367647 [07:55:53] (03PS8) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [07:58:29] (03PS4) 10Muehlenhoff: Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 [07:59:28] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7146/" [puppet] - 10https://gerrit.wikimedia.org/r/367646 (owner: 10Giuseppe Lavagetto) [07:59:38] (03PS2) 10Giuseppe Lavagetto: etcd: convert to using systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/367646 [07:59:43] (03CR) 10jerkins-bot: [V: 04-1] Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 (owner: 10Muehlenhoff) [08:04:06] (03PS2) 10Elukey: role::prometheus::memcached_exporter: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/367375 [08:05:25] (03CR) 10Elukey: [C: 032] role::prometheus::memcached_exporter: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/367375 (owner: 10Elukey) [08:06:10] (03PS3) 10Giuseppe Lavagetto: etcdmirror: convert to using systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/367647 [08:13:33] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#3469071 (10Marostegui) click_tracking_user_properties was empty in lots of places, but... [08:14:03] (03PS4) 10Giuseppe Lavagetto: etcdmirror: convert to using systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/367647 [08:16:28] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7148" [puppet] - 10https://gerrit.wikimedia.org/r/367647 (owner: 10Giuseppe Lavagetto) [08:19:52] (03PS3) 10Gehel: wdqs - send ldf traffic to wdqs1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/363596 (https://phabricator.wikimedia.org/T166244) [08:20:40] PROBLEM - puppet last run on conf2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:21:41] (03CR) 10Gehel: [C: 032] wdqs - send ldf traffic to wdqs1003.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/363596 (https://phabricator.wikimedia.org/T166244) (owner: 10Gehel) [08:23:26] (03PS1) 10Giuseppe Lavagetto: systemd::syslog: depend on the service, not base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/367649 [08:28:41] (03PS2) 10Gehel: Fix nginx parametrization - use variable consistently for port [puppet] - 10https://gerrit.wikimedia.org/r/364349 (owner: 10Smalyshev) [08:28:47] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171492#3469126 (10fgiunchedi) [08:28:49] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3469128 (10fgiunchedi) [08:29:56] (03PS1) 10Marostegui: check_private_data.py: Add socket parameter [puppet] - 10https://gerrit.wikimedia.org/r/367650 (https://phabricator.wikimedia.org/T153743) [08:30:08] 10Operations, 10LDAP-Access-Requests, 10Wikidata-Sprint: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3469133 (10christophneuroth) @MoritzMuehlenhoff additional NDA has been signed 📝 ✅ 🎉 [08:30:09] (03CR) 10Gehel: [C: 032] Fix nginx parametrization - use variable consistently for port [puppet] - 10https://gerrit.wikimedia.org/r/364349 (owner: 10Smalyshev) [08:30:17] (03CR) 10Giuseppe Lavagetto: [C: 032] systemd::syslog: depend on the service, not base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/367649 (owner: 10Giuseppe Lavagetto) [08:30:26] (03PS2) 10Giuseppe Lavagetto: systemd::syslog: depend on the service, not base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/367649 [08:37:50] RECOVERY - puppet last run on conf2002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:38:49] (03Abandoned) 10Giuseppe Lavagetto: role::configcluster: switch to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365559 (owner: 10Giuseppe Lavagetto) [08:41:34] (03CR) 10Ema: varnish: Avoid std.fileread() and use new errorpage template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [08:48:51] 10Operations, 10Commons, 10Traffic, 10media-storage: 503 error for certain JPG thumbnail: "Backend fetch failed" - https://phabricator.wikimedia.org/T171421#3469152 (10fgiunchedi) @Aklapper _usually_ traffic since this indicates varnish failure to fetch and most likely a network or varnish problem. See als... [08:54:02] (03PS2) 10Marostegui: check_private_data.py: Add socket parameter [puppet] - 10https://gerrit.wikimedia.org/r/367650 (https://phabricator.wikimedia.org/T153743) [08:57:57] (03PS5) 10Muehlenhoff: Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 [08:58:53] (03CR) 10jerkins-bot: [V: 04-1] Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 (owner: 10Muehlenhoff) [09:01:08] (03PS6) 10Muehlenhoff: Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 [09:14:19] !log upgrade diamond to 4.0.515 in ulsfo and esams - T97635 [09:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:31] T97635: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635 [09:15:12] !log upgrade restbase-test* and restbase-dev* to latest OpenJDK security update [09:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:58] 10Operations, 10ops-ulsfo, 10hardware-requests, 10Patch-For-Review: Decommission cp400[1-4] - https://phabricator.wikimedia.org/T169020#3384648 (10fgiunchedi) cp400[234] were not 'puppet node clean' nor 'puppet node deactivate' btw, I've done that now [09:18:48] (03PS2) 10Ema: varnish: reject phabricator uploads from WP0 users [puppet] - 10https://gerrit.wikimedia.org/r/367422 (https://phabricator.wikimedia.org/T168142) [09:19:51] (03CR) 10Ema: [C: 032] varnish: reject phabricator uploads from WP0 users [puppet] - 10https://gerrit.wikimedia.org/r/367422 (https://phabricator.wikimedia.org/T168142) (owner: 10Ema) [09:20:57] !log upgrade diamond to 4.0.515 in codfw - T97635 [09:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:07] T97635: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635 [09:23:29] PROBLEM - puppet last run on mw2243 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[diamond] [09:42:37] (03CR) 10MarcoAurelio: Allow contentadmin/sysop to configure blocking AbuseFilters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367369 (owner: 10MarcoAurelio) [09:45:16] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2] [09:46:33] (03PS3) 10MarcoAurelio: Allow contentadmin/sysop to configure blocking AbuseFilters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367369 [09:49:01] (03CR) 10MarcoAurelio: Allow contentadmin/sysop to configure blocking AbuseFilters (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367369 (owner: 10MarcoAurelio) [09:50:12] 10Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3469311 (10zhuyifei1999) [09:53:05] RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:01:34] (03PS9) 10Giuseppe Lavagetto: role::configcluster: move to future environment [puppet] - 10https://gerrit.wikimedia.org/r/365572 [10:01:36] (03PS1) 10Giuseppe Lavagetto: apt::repository: fix for future parser [puppet] - 10https://gerrit.wikimedia.org/r/367658 [10:01:38] (03PS1) 10Giuseppe Lavagetto: prometheus::node::exporter: ugly workaround for future parser [puppet] - 10https://gerrit.wikimedia.org/r/367659 [10:03:22] 10Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3469363 (10zhuyifei1999) [10:05:00] (03CR) 10MarcoAurelio: "@Amire80 What's the status on this one? Can you detail what it is exactly changing so we can inform the requestors? Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363370 (https://phabricator.wikimedia.org/T168727) (owner: 10MarcoAurelio) [10:12:45] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:13:25] (03PS3) 10Marostegui: Parsercache: Reduce expiration time to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [10:16:34] 10Operations, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3469431 (10MoritzMuehlenhoff) [10:19:08] (03PS3) 10MarcoAurelio: High density logos for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366875 (https://phabricator.wikimedia.org/T170604) [10:19:17] (03PS2) 10MarcoAurelio: High density logos for es.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367008 (https://phabricator.wikimedia.org/T170604) [10:20:09] (03CR) 10Jcrespo: "The code is ok, but let's move it to main() or to a parse_options function." [puppet] - 10https://gerrit.wikimedia.org/r/367650 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [10:20:59] (03PS6) 10Jcrespo: Parsercache: Purge rows every day, and reduce TTL to 22 days [puppet] - 10https://gerrit.wikimedia.org/r/361656 (https://phabricator.wikimedia.org/T167784) [10:21:06] (03CR) 10Jcrespo: [C: 031] Parsercache: Purge rows every day, and reduce TTL to 22 days [puppet] - 10https://gerrit.wikimedia.org/r/361656 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [10:21:54] (03CR) 10Jcrespo: [C: 031] "We should deploy this one first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [10:23:46] (03CR) 10Marostegui: [C: 032] Parsercache: Reduce expiration time to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [10:25:38] (03Merged) 10jenkins-bot: Parsercache: Reduce expiration time to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [10:25:47] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#3469475 (10jcrespo) [10:26:06] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#2582458 (10jcrespo) [10:26:09] (03CR) 10jenkins-bot: Parsercache: Reduce expiration time to 22 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361659 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [10:27:02] !log upgrade restbase2010 to latest OpenJDK security update [10:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:06] !log marostegui@tin Synchronized wmf-config/InitialiseSettings.php: Parsercache: Reduce expiration time to 22 days - T167784 (duration: 00m 44s) [10:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:15] T167784: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784 [10:29:32] (03CR) 10Marostegui: "> The code is ok, but let's move it to main() or to a parse_options" [puppet] - 10https://gerrit.wikimedia.org/r/367650 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [10:33:17] (03CR) 10Jcrespo: [C: 031] check_private_data.py: Add socket parameter [puppet] - 10https://gerrit.wikimedia.org/r/367650 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [10:33:44] jynus: https://gerrit.wikimedia.org/r/#/c/361656/ - I will merge this, unless you want to do it yourself :) [10:33:59] do it [10:34:13] doing it [10:34:21] (03CR) 10Marostegui: [C: 032] Parsercache: Purge rows every day, and reduce TTL to 22 days [puppet] - 10https://gerrit.wikimedia.org/r/361656 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [10:34:43] I would run it now, actually after deploying [10:35:20] deployed [10:36:10] (03PS1) 10Jcrespo: prometheus-mysqld-exporter: Classify dbstores, now they can shard [puppet] - 10https://gerrit.wikimedia.org/r/367660 (https://phabricator.wikimedia.org/T170666) [10:36:22] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Classify dbstores, now they can shard [puppet] - 10https://gerrit.wikimedia.org/r/367660 (https://phabricator.wikimedia.org/T170666) [10:39:41] (03PS1) 10Gehel: maps - ensure cleanup also deletes related objects [puppet] - 10https://gerrit.wikimedia.org/r/367661 (https://phabricator.wikimedia.org/T169011) [10:41:07] (03CR) 10Gehel: [C: 032] maps - ensure cleanup also deletes related objects [puppet] - 10https://gerrit.wikimedia.org/r/367661 (https://phabricator.wikimedia.org/T169011) (owner: 10Gehel) [10:41:45] !log Run mwscript purgeParserCache.php --wiki=aawiki --age=1900800 --msleep 500 from terbium - T167784 [10:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:56] T167784: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784 [10:42:19] are you running it? [10:42:31] yep [10:43:03] from a screen called: purge_old_rows [10:44:05] I see the deletes coming in [10:44:28] 2K of them per server [10:44:50] (03CR) 10Jcrespo: [C: 032] prometheus-mysqld-exporter: Classify dbstores, now they can shard [puppet] - 10https://gerrit.wikimedia.org/r/367660 (https://phabricator.wikimedia.org/T170666) (owner: 10Jcrespo) [10:44:56] (03PS3) 10Jcrespo: prometheus-mysqld-exporter: Classify dbstores, now they can shard [puppet] - 10https://gerrit.wikimedia.org/r/367660 (https://phabricator.wikimedia.org/T170666) [10:45:02] yeah, I am running it with the default 500ms sleep, will be slow but safer for lag [10:46:13] 0.12% [10:46:26] yeah :( [10:46:51] some weeks ago, when the issues first arised, tim ran it with 200ms and it generated some lag [10:50:22] to be fair, in this case I do not care much about lag [10:50:40] but I do not want to saturate the host's IOPS [10:50:46] !log installing openjdk security updates on restbase* [10:50:47] specially beacuse it can wait [10:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:25] (03PS1) 10Ema: pybal::monitoring: add check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/367662 (https://phabricator.wikimedia.org/T134893) [10:54:52] (03PS2) 10Ema: pybal::monitoring: add check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/367662 (https://phabricator.wikimedia.org/T134893) [10:56:18] (03PS1) 10Marostegui: Parsercache: Purge only certain days [puppet] - 10https://gerrit.wikimedia.org/r/367665 (https://phabricator.wikimedia.org/T167784) [10:56:20] jynus: ^ [10:57:14] (03CR) 10jerkins-bot: [V: 04-1] Parsercache: Purge only certain days [puppet] - 10https://gerrit.wikimedia.org/r/367665 (https://phabricator.wikimedia.org/T167784) (owner: 10Marostegui) [10:58:22] (03PS2) 10Marostegui: Parsercache: Purge only certain days [puppet] - 10https://gerrit.wikimedia.org/r/367665 (https://phabricator.wikimedia.org/T167784) [11:01:08] (03CR) 10Jcrespo: [C: 04-1] Parsercache: Purge only certain days [puppet] - 10https://gerrit.wikimedia.org/r/367665 (https://phabricator.wikimedia.org/T167784) (owner: 10Marostegui) [11:01:23] precisley by purging more often, we will have less problems [11:01:43] this will take a lot of time because it is the first time [11:02:07] ok, let's wait then and run two iterations once the first one is done [11:02:11] the following ones should be faster [11:02:32] if you increase the time between purges [11:02:33] !log installing openjdk security updates on elastic* [11:02:38] it will be worse, not better [11:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:13] also, most likely, we may be able to run several instances of this [11:03:22] or even help it manually on a separate thread [11:03:35] let's wait then for it to finish and check a second iteration once it is done [11:03:41] yes [11:03:47] But we will certainly need to kill it before midnight [11:03:52] I will do it before going to bed [11:03:54] and !log it [11:04:01] no need [11:04:07] we can set it up now [11:04:35] Jeez…0.54% XD [11:04:56] at command [11:05:17] or timeout :) [11:05:26] * elukey lunch [11:06:14] or leave it as is [11:06:23] there is apparently 2 instances running right now [11:07:51] one by cron [11:10:26] by cron? [11:10:56] PID 5175 and children [11:11:22] with old parameters [11:11:30] yea [11:11:32] from 23 jul [11:11:56] let's kill all of them, and start one with timeout 10h (so it will be killed after 10 hours, which is almost midnight) [11:12:18] ok [11:12:29] i will do it [11:12:29] log it in case it generates logs [11:12:33] or root mails [11:13:01] yeah [11:13:24] !log Killing old running instances of purgeParserCache.php in terbium - https://phabricator.wikimedia.org/T167784 [11:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:58] wait [11:14:08] the patch is wrong [11:14:18] 0 1 * * 0 [11:14:21] the old corn patch? [11:14:23] cron [11:15:14] That is 1:00 on every sunday [11:16:15] !log upgrading/restarting logstash* for openjdk security update [11:16:19] ah, interesting so: https://gerrit.wikimedia.org/r/#/c/361656/6/modules/mediawiki/manifests/maintenance/parsercachepurging.pp assumes that if no weekday specified it is 0? [11:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:35] which is so wrong [11:16:41] indeed [11:17:40] (03PS1) 10Jcrespo: parserCachePurge: Run it every day, not only on Sunday [puppet] - 10https://gerrit.wikimedia.org/r/367666 (https://phabricator.wikimedia.org/T167784) [11:17:56] ^ [11:18:03] (03CR) 10Marostegui: [C: 031] parserCachePurge: Run it every day, not only on Sunday [puppet] - 10https://gerrit.wikimedia.org/r/367666 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [11:18:44] "Optional; if specified, must be between 0 and 7" [11:19:29] I think it is ok, it says the same from the other parameters [11:19:44] hi marostegui [11:19:59] !log Start a run of "timeout 10h purgeParserCache.php" on terbium, which will be killed at around 21:00 UTC so it doesn't overlap with the normal cron run - T167784 [11:20:03] hi aude [11:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:10] T167784: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784 [11:20:26] i'd like to populate the wb_terms column on test wikidata now [11:20:39] (03CR) 10Jcrespo: [C: 032] parserCachePurge: Run it every day, not only on Sunday [puppet] - 10https://gerrit.wikimedia.org/r/367666 (https://phabricator.wikimedia.org/T167784) (owner: 10Jcrespo) [11:20:44] you're not doign any maintenance that would be an issue for this? [11:21:20] aude: we are not doing any maintenance on that shard, if you have proper throttling feel free to go ahead :) [11:21:22] 10Operations, 10monitoring: Replace nrpe 2.15 (& evaluate alternatives) - https://phabricator.wikimedia.org/T157853#3469624 (10MoritzMuehlenhoff) > This task was more than that, and namely getting rid of 2.15 everywhere on the fleet, which we could now easily do with a jessie/trusty backport. Now that compatib... [11:21:52] (03Abandoned) 10Marostegui: Parsercache: Purge only certain days [puppet] - 10https://gerrit.wikimedia.org/r/367665 (https://phabricator.wikimedia.org/T167784) (owner: 10Marostegui) [11:21:57] script does batching + wait for slaves / replication etc [11:22:01] so think it' sok [11:22:03] sounds good :) [11:22:15] undefined 'weekday' from '0' [11:22:17] ? [11:22:19] ok [11:22:48] aude: ETA finish? [11:22:58] jynus: it looks good to me now: 0 1 * * * [11:23:10] yes [11:23:31] for some reason, not declaring it is mapped to 0, not undefined [11:24:05] I wonder how many people and commiting the same mistake for what I would expect is the sane parameter handling [11:24:20] yeah, easy mistake to make [11:24:32] I am glad you checked it [11:24:37] on terbium [11:24:39] I don't think it is a mistake, but a bug [11:24:54] well, yes :) [11:25:12] should rebuildEntityPerPage.php run every week or every day, for example [11:26:47] aude: I would suggest adding it to https://wikitech.wikimedia.org/wiki/Deployments [11:26:55] on top if it is long-running [11:27:03] or on a window you reserve [11:27:17] once it is reserved for you, nobody can take it from you :-) [11:28:50] jynus: it's rebuildTermIndex and is only a one time thing [11:29:05] or maybe one off again if we have some need for it [11:29:19] aude: better then, add it to the deployment window [11:29:28] you can create now a new one [11:30:08] jynus: maybe it's done in 20 min [11:31:10] for the big one, wikidata i would suggest to get a window [11:31:17] even if it is 40minutes or something [11:31:28] just literally add a line to that page [11:31:38] yeah for wikidata, we will scheduel [11:31:39] I think it doesn't take much time [11:32:08] and if someone else comes and has a similar task- they will have to wait [11:37:15] jouncebot: next [11:37:15] In 1 hour(s) and 22 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T1300) [11:44:42] trying to save some space due to the extra binlogs: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=pc1005 [11:46:13] (03CR) 10Zfilipin: rake: new rakefile specifically for CI (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) (owner: 10Giuseppe Lavagetto) [11:58:28] !log installing binutils update from jessie point release [11:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:36] !log testing defragmenting pc2004 - if lag is created, ignore [11:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:16] (03PS3) 10Marostegui: check_private_data.py: Add socket parameter [puppet] - 10https://gerrit.wikimedia.org/r/367650 (https://phabricator.wikimedia.org/T153743) [12:08:18] !log ran rebuildTermSqlIndex.php on test.wikidata [12:08:24] (03CR) 10Marostegui: [C: 032] check_private_data.py: Add socket parameter [puppet] - 10https://gerrit.wikimedia.org/r/367650 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:05] !log upgrade diamond to 4.0.515 in eqiad - T97635 [12:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:13] T97635: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635 [12:09:18] aude: did it finish? [12:10:57] godog: yay [12:11:29] 10Operations, 10monitoring, 10User-fgiunchedi: Diamond log level set to DEBUG spams syslog - https://phabricator.wikimedia.org/T171580#3469704 (10fgiunchedi) [12:11:54] (03PS1) 10DCausse: Add script.max_compilations_per_minute to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) [12:12:12] 10Operations, 10Diamond, 10Upstream: Diamond load averages do not contain scaled versions - https://phabricator.wikimedia.org/T125411#3469722 (10zhuyifei1999) [12:12:14] 10Operations, 10monitoring, 10User-fgiunchedi: update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#3469725 (10zhuyifei1999) [12:13:55] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[scap] [12:17:37] marostegui: yes [12:17:40] looks good [12:18:05] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK [12:18:21] aude: how long did it take? [12:19:14] aude: would you mind updating : https://phabricator.wikimedia.org/T171461 just for the record ? [12:21:48] 35 min [12:21:57] test.wikidata has a lot more properties than wikidata [12:24:35] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [12:27:19] expected, likely new diamodn version ^ [12:27:25] zhuyifei1999_: \o/ [12:29:05] PROBLEM - MD RAID on ms-be1016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [12:29:06] ACKNOWLEDGEMENT - MD RAID on ms-be1016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T171582 [12:29:43] (03CR) 10Gehel: [C: 04-1] Add script.max_compilations_per_minute to elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [12:29:47] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171582#3469759 (10ops-monitoring-bot) [12:30:55] PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CRITICAL - load average: 175.65, 110.76, 64.60 [12:33:15] PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:33:20] 10Operations: Integrate jessie 8.8 point release - https://phabricator.wikimedia.org/T164703#3469771 (10MoritzMuehlenhoff) These are fully rolled out: mongodb unzip systemd guile-2.0 gnutls28 libxslt None of the packages removed for 8.8 were present in our environment. [12:33:27] PROBLEM - Check systemd state on ms-be1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:33:27] PROBLEM - Disk space on ms-be1016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error [12:33:27] ugh 1016 is really unhappy, I'm getting a password prompt when ssh'ing [12:33:55] RECOVERY - very high load average likely xfs on ms-be1016 is OK: OK - load average: 27.71, 71.23, 57.51 [12:34:28] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3469775 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:35:10] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171582#3469776 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Cmjohnson [12:35:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3469779 (10MoritzMuehlenhoff) a:03Cmjohnson [12:36:09] 10Operations, 10monitoring: Replace nrpe 2.15 (& evaluate alternatives) - https://phabricator.wikimedia.org/T157853#3469782 (10faidon) 05Open>03Resolved Yeah, I thought about it some more and I concur. 2.15's "SSL" is a joke, but in our case it doesn't matter much as pretty much everything that we send ove... [12:37:23] !log powercycle ms-be1016, couldn't get getty output from console [12:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:56] (03CR) 10Hashar: "Havent tested it but at least here is some review :-]" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/366591 (https://phabricator.wikimedia.org/T166888) (owner: 10Giuseppe Lavagetto) [12:40:37] (03CR) 10DCausse: Add script.max_compilations_per_minute to elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [12:40:43] (03PS2) 10DCausse: Add script.max_compilations_per_minute to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) [12:40:49] 10Operations, 10Puppet, 10Cloud-VPS: Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188#3469789 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:41:06] RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:41:25] RECOVERY - swift-container-updater on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:41:25] RECOVERY - Check systemd state on ms-be1016 is OK: OK - running: The system is fully operational [12:41:35] RECOVERY - Disk space on ms-be1016 is OK: DISK OK [12:42:15] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:42:55] jouncebot: next [12:42:55] In 0 hour(s) and 17 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T1300) [12:44:34] !log restarting cassandra on maps clusters [12:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:11] !log enabling mw1260 (jessie-based video scaler) for job processing [12:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:46] RECOVERY - Check systemd state on mw1260 is OK: OK - running: The system is fully operational [12:47:46] 10Operations, 10Cloud-VPS, 10monitoring, 10User-fgiunchedi: Diamond collectors collects NFS statistics on Cloud-VPS - https://phabricator.wikimedia.org/T171583#3469801 (10zhuyifei1999) [12:48:03] 10Operations, 10Cloud-VPS, 10monitoring, 10User-fgiunchedi: Diamond collectors collects NFS statistics on Cloud-VPS - https://phabricator.wikimedia.org/T171583#3469801 (10zhuyifei1999) [12:50:45] PROBLEM - kartotherian endpoints health on maps2003 is CRITICAL: /{src}/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200) [12:50:50] PROBLEM - kartotherian endpoints health on maps2004 is CRITICAL: /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: 200): /{src}/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200) [12:51:04] kartotherian is probably me, checking [12:51:16] PROBLEM - kartotherian endpoints health on maps2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200) [12:51:40] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [12:51:55] RECOVERY - kartotherian endpoints health on maps2004 is OK: All endpoints are healthy [12:52:39] <_joe_> gehel: that you I guess? [12:52:40] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy [12:52:52] yep, that's me, recovering... [12:53:26] RECOVERY - kartotherian endpoints health on maps2002 is OK: All endpoints are healthy [12:53:27] jouncebot: next [12:53:27] In 0 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T1300) [12:53:40] probably not enough delay between the restart of each cassandra node... they did look like they were fully started... [12:53:49] hasharLunch: should I get ready for swat, or do you insist? ;) [12:53:55] RECOVERY - kartotherian endpoints health on maps2003 is OK: All endpoints are healthy [12:53:58] * gehel is probably missing something. Checking logs to understand... [12:54:01] zeljkof: I can process them [12:54:06] they seem straightforward [12:54:23] hashar: ok, go ahead then, and have fun :) [12:54:42] 10Operations, 10monitoring, 10User-fgiunchedi: Update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#3469823 (10fgiunchedi) Applying https://github.com/Ssawa/Diamond/commit/8b58d7a7dd2a1249731b0642b35e7d7cbdcf611f from the github issue fixes it and stop is fast again. The patch... [12:54:55] PROBLEM - kartotherian endpoints health on maps2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200) [12:56:53] !log rolling restart of aqs* for jvm upgrades [12:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:55] RECOVERY - kartotherian endpoints health on maps2004 is OK: All endpoints are healthy [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T1300). Please do the needful. [13:00:04] TabbyCat and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:16] meow o/ [13:00:37] (03PS2) 10DCausse: [WIP] Bump ltr plugin to include logging features [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/364462 [13:00:39] (03PS4) 10Hashar: High density logos for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366875 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [13:01:23] (03PS3) 10Hashar: High density logos for es.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367008 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [13:01:37] TabbyCat: hello. I am going to deploy both changes in one go [13:01:41] TabbyCat: and purge the logos [13:01:53] hashar: bonjour, il me semble correct [13:02:02] oui :-} [13:02:21] en avant :) [13:04:10] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367008 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [13:04:15] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366875 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [13:05:05] 10Operations, 10LDAP-Access-Requests, 10Wikidata-Sprint: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3469835 (10MoritzMuehlenhoff) @christophneuroth : Ok, let's wait for Rachel to confirm on this task, then I'll enable your access. [13:05:27] !log restarting elastic relforge100x servers to pick up new version of the ltr plugin [13:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:02] (03Merged) 10jenkins-bot: High density logos for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366875 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [13:06:11] (03CR) 10jenkins-bot: High density logos for es.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366875 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [13:06:50] (03Merged) 10jenkins-bot: High density logos for es.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367008 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [13:07:45] !log hashar@tin Synchronized static/images/project-logos: High density logos for es.wikiquote - T170604 (duration: 00m 46s) [13:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:57] T170604: High density logos for spanish sister projects - https://phabricator.wikimedia.org/T170604 [13:08:24] (03CR) 10jenkins-bot: High density logos for es.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367008 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [13:08:25] 10Operations, 10ops-codfw: failing RAID disk on frdb2001 - https://phabricator.wikimedia.org/T171584#3469848 (10Jgreen) [13:08:48] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: High density logos for es.wikiquote - T170604 (duration: 00m 49s) [13:08:51] 10Operations, 10Cloud-VPS, 10monitoring, 10User-fgiunchedi: Diamond collectors collects NFS statistics on Cloud-VPS - https://phabricator.wikimedia.org/T171583#3469861 (10zhuyifei1999) Workaround (on render side): set a threshold of 1TB `aliasByNode(maximumBelow(maximumAbove(HOST.diskspace.*.byte_avail,0),... [13:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:23] (03PS1) 10Faidon Liambotis: Use Python yaml.safe_load everywhere [puppet] - 10https://gerrit.wikimedia.org/r/367671 [13:10:18] !log hashar@tin Synchronized static/images/project-logos: High density logos for es.wikisource - T170604 (duration: 00m 44s) [13:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:46] TabbyCat: doing the last sync and I will purge the logos [13:11:20] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: High density logos for es.wikisource - T170604 (duration: 00m 43s) [13:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:32] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/367671 (owner: 10Faidon Liambotis) [13:12:52] hashar: okay, I'm messaging the projects to let them know [13:13:00] !log Purged project-logos for eswikisource/eswikiquote high density logos T170604 : find static/images/project-logos -maxdepth 1 -type f| sed -e 's%^%https://en.wikipedia.org/%' [13:13:06] so if they see anything strange I can take a look [13:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:09] T170604: High density logos for spanish sister projects - https://phabricator.wikimedia.org/T170604 [13:14:06] so far everything looks good to me [13:14:16] (03CR) 10Hashar: "I am holding this change until I or Zeljko get to reach Lydia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366866 (https://phabricator.wikimedia.org/T169060) (owner: 10Daniel Kinzler) [13:14:53] zeljkof: I will have to leave in ~45 minutes. https://gerrit.wikimedia.org/r/#/c/366866/ require to be synced with Lydia because there is an announcement to be made to the wikidata community [13:15:03] !log rebooting achernar for kernel update [13:15:05] zeljkof: so if I am no more around, I guess they will poke you [13:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:25] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:15:25] PROBLEM - eventlogging-service-eventbus endpoints health on kafka2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:15:25] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:37] hashar: I should be around for a couple more hours [13:15:41] great [13:16:13] elukey: eventbus not happy about achernar's reboot? [13:16:25] RECOVERY - eventlogging-service-eventbus endpoints health on kafka2003 is OK: All endpoints are healthy [13:17:00] * elukey cries [13:17:18] going to check in 1 min [13:18:33] achernar is back up already, nice [13:18:47] (03PS3) 10Gehel: Add script.max_compilations_per_minute to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [13:19:56] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [13:20:15] ema: something is wrong, I can still see tons of DNS queries on kafka2003 [13:20:22] I suspect that eventbus wasn't restarted [13:20:24] 10Operations, 10Cloud-VPS, 10monitoring, 10User-fgiunchedi: Diamond collectors collects NFS statistics on Cloud-VPS - https://phabricator.wikimedia.org/T171583#3469876 (10zhuyifei1999) Created [[https://github.com/wikimedia/nagf/pull/16|PR to nagf]]. [13:20:27] and it didn't pick up the new settings [13:20:32] elukey: oh [13:20:33] ! [13:20:35] :) [13:20:42] (03CR) 10DCausse: Add script.max_compilations_per_minute to elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [13:20:56] hmm it def looks like it wasn't restarted there. [13:21:04] but why not?! scap sure made it look like it was [13:21:06] chekcing others... [13:21:14] elukey, ottomata: we can doublecheck with the reboot of chromium/hydrogen (after we've verified that eventbus was restartd in eqiad) [13:21:28] (03CR) 10Faidon Liambotis: [C: 031] Clean up stray binary packages after Debian updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/367645 (owner: 10Muehlenhoff) [13:21:45] ok that explains why the other kafkas in codfw were fine [13:21:49] (03PS4) 10Gehel: Add script.max_compilations_per_minute to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [13:22:06] ema: confirmed, just depool/restart/pool kafka2003, no more dns lookups [13:22:08] (03CR) 10Faidon Liambotis: [C: 032] Use Python yaml.safe_load everywhere [puppet] - 10https://gerrit.wikimedia.org/r/367671 (owner: 10Faidon Liambotis) [13:22:10] it look slike it was just kafka2003 [13:22:15] alright [13:22:21] elukey: i did have some trouble with scap yesterday [13:22:24] (03CR) 10DCausse: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [13:22:28] yep I saw them ottomata ! [13:22:29] and had to depool, deploy, pool one by one [13:22:58] elukey: moritzm is about to reboot the eqiad resolvers, how about kafka1*? [13:23:12] I am checking :) [13:23:26] awesome :) [13:23:36] (03CR) 10Muehlenhoff: Clean up stray binary packages after Debian updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/367645 (owner: 10Muehlenhoff) [13:23:42] kafka1001 is not happy [13:23:49] ? [13:24:00] still making a lot of queries [13:24:17] hm [13:24:24] shall i depool restart? [13:24:25] it? [13:24:27] I'll wait with chromium/hydrogen reboots until the kafka* situation is sorted (and need to wait for achernar's NTP to be fully synced as well) [13:24:39] ottomata: doing it now :) [13:24:42] ok [13:24:45] yayay [13:24:45] yall on it [13:24:46] repooling achernar for recdns [13:24:56] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [13:25:58] all good now on kafka1001, just repooled [13:26:06] moritzm: eventbus eqiad should be good [13:27:47] (03PS5) 10Gehel: Add script.max_compilations_per_minute to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [13:32:00] elukey: proceeding with hydrogen (eqiad recdns), then? [13:32:01] (03CR) 10DCausse: [C: 031] Add script.max_compilations_per_minute to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [13:32:17] moritzm: green light from me [13:32:58] !log rebooting hydrogen for kernel update [13:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:14] urandom: hello! https://phabricator.wikimedia.org/P5798 looks weird, am I doing something wrong? [13:33:25] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [13:33:25] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:33:25] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (ex [13:33:25] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [13:33:25] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:33:35] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:33:45] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:33:45] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:33:45] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 40 [13:33:45] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:33:45] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) timed out before a response was received: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [13:33:45] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) timed out before a response was received [13:33:45] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) timed out before a response was received [13:33:46] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [13:33:46] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/most-read/{yyyy}/{mm}/{dd} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [13:33:47] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [13:33:47] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) timed out before a response was received [13:33:48] was just going to say I was getting some 503s.. [13:33:48] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [13:33:59] echo is also broken... [13:34:05] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) timed out before a response was received [13:34:06] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:34:06] PROBLEM - restbase endpoints health on cerium is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a g [13:34:06] is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:34:15] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [13:34:15] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (ex [13:34:15] wikipedia.org/v1/page/revision/{revision} (Get rev by ID) timed out before a response was received [13:34:15] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) timed out before a response was received [13:34:15] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) timed out before a response was received: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [13:34:15] PROBLEM - puppet last run on wtp1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:34:15] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [13:34:16] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (ex [13:34:25] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:34:25] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [13:34:26] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [13:34:35] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:34:35] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle = None: /en.wikipedia.org/v1/page/revision/{revision} (Get rev by ID) timed out before a response was received [13:34:44] is this hydrogen ? [13:34:45] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [13:34:45] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [13:34:46] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) [13:35:07] RECOVERY - restbase endpoints health on cerium is OK: All endpoints are healthy [13:35:07] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [13:35:07] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [13:35:22] elukey: not sure, hydrogen is back up, but several of the recoveries appeared while it was still booting [13:35:22] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [13:35:22] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [13:35:22] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [13:35:22] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [13:35:25] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [13:35:25] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [13:35:35] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [13:35:35] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [13:35:35] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [13:35:45] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [13:35:45] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [13:35:45] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [13:35:45] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [13:35:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [13:35:45] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [13:35:45] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [13:35:46] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [13:35:46] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [13:35:47] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [13:35:47] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [13:35:48] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [13:36:05] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [13:36:15] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [13:36:23] looking like we're back (at least with what I was doing) [13:37:06] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [13:37:15] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:18] 50K 503 per minute [13:37:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [13:37:55] I was restarting AQS nodes but I don't think that this mess could be caused by that, plus I didn't see any impact on the hosts [13:38:05] checking to be sure [13:38:19] ACKNOWLEDGEMENT - HP RAID on ms-be1016 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T171585 [13:38:23] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171585#3469939 (10ops-monitoring-bot) [13:38:35] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:38:37] moritzm: please don't repool hydrogen if it's still depooled [13:38:52] looks like the last minutes are showing 0 fatals [13:39:00] ema: depooling again [13:39:07] moritzm: no wait [13:39:15] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [13:39:20] moritzm: I thought you hadn't repooled yet [13:39:21] last one was at 13:35 UTC [13:39:26] ema: already done, shall I repool? [13:39:51] moritzm: nope, wait just a second [13:39:58] k [13:40:31] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3469948 (10fgiunchedi) [13:41:14] zeljkof: so the wikidata patch will be for next monday. The point of contacts are attending a conference this week [13:41:17] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T167264#3322381 (10fgiunchedi) [13:41:22] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171582#3469957 (10fgiunchedi) [13:41:24] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3469958 (10fgiunchedi) [13:41:26] hashar: ok [13:41:26] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171585#3469959 (10fgiunchedi) [13:42:18] moritzm: feel free to repool it [13:42:34] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3456811 (10fgiunchedi) 05duplicate>03Open [13:42:34] k, done [13:43:11] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3469982 (10fgiunchedi) [13:43:27] sorry about the phab spam [13:43:45] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:43:47] (03CR) 10Hashar: "This change got scheduled for today SWAT slot by Amir1 however Lydia / Auregann are not available to do the announcement." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366866 (https://phabricator.wikimedia.org/T169060) (owner: 10Daniel Kinzler) [13:44:52] what were the services impacted? I can see graphoid, mobileapps and reccomendation? [13:45:15] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:47:46] elukey: api, parsoid and logstash judging from lvs1003's logs [13:48:15] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:48:21] (03PS7) 10Muehlenhoff: Clean up stray binary packages after Debian updates [puppet] - 10https://gerrit.wikimedia.org/r/367645 [13:48:31] (03PS6) 10Gehel: Add script.max_compilations_per_minute to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [13:48:35] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:48:56] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:49:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:50:21] (03CR) 10Gehel: [C: 032] Add script.max_compilations_per_minute to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/367668 (https://phabricator.wikimedia.org/T171579) (owner: 10DCausse) [13:50:50] 10Operations, 10Traffic: Implement machine-local forwarding DNS caches - https://phabricator.wikimedia.org/T171498#3469995 (10BBlack) >>! In T171498#3468712, @faidon wrote: > - I'm worried a little bit that this will hide issues like the ones you mentioned under the carpet. The cases where services are latency... [13:51:12] elukey: it looks like cassandra isn't coming back up, can you confirm whether or not that is true? [13:51:27] elukey: is that instance running after it exits with the exception? [13:52:31] urandom: it went up correctly [13:53:04] ok, so the scripts detection of that failed, i wonder why? [13:53:08] (03PS1) 10MarcoAurelio: HD logos for eswikivoyage and added some paths. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367676 (https://phabricator.wikimedia.org/T170604) [13:53:18] elukey: it attempts to connect to port 9042 [13:54:10] (03PS2) 10MarcoAurelio: HD logos for eswikivoyage and added some missing paths to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367676 (https://phabricator.wikimedia.org/T170604) [13:55:50] _joe_: we're ready for restbase2002 & 1015 whenever you are [13:56:20] <_joe_> urandom: ok, give me 10 minutes and we can start with 2002 [13:56:29] _joe_: sure [13:57:39] urandom: maybe it doesn't come up in time? [13:58:14] I can try with --attempts 30 [13:58:23] ¯\_(ツ)_/¯ [13:58:36] trying with aqs1006 now :) [13:58:41] seems like it should have come up in that amount of time, but it's probably worth trying [13:59:37] (03PS1) 10Herron: Lists: Change exim filter for spam observed from qq.com [puppet] - 10https://gerrit.wikimedia.org/r/367677 (https://phabricator.wikimedia.org/T170601) [13:59:37] elukey: it is the time it takes for the cql port to open up, which happens last, after having joined the ring and reading commitlogs, etc [13:59:55] urandom: I blame the script author! :P [14:00:03] elukey: though now that i think of it, c-foreach-restart does a drain, so the commitlogs should be zero length [14:00:12] elukey: that seems reasonable, yeah :) [14:00:19] elukey: i blame him too [14:00:50] (03CR) 10Herron: [C: 032] Lists: Change exim filter for spam observed from qq.com [puppet] - 10https://gerrit.wikimedia.org/r/367677 (https://phabricator.wikimedia.org/T170601) (owner: 10Herron) [14:01:45] PROBLEM - puppet last run on fermium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:01:48] (03PS1) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [14:01:51] <_joe_> urandom: what was the ticket again? [14:02:17] urandom: worked! [14:02:21] (03CR) 10jerkins-bot: [V: 04-1] Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 (owner: 10Giuseppe Lavagetto) [14:02:26] RECOVERY - puppet last run on wtp1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:02:33] _joe_: looking [14:02:51] _joe_: https://phabricator.wikimedia.org/T162735 [14:03:12] <_joe_> oh it's an hp ilo [14:03:21] elukey: that it odd, how many times did it log "not listening, will retry.." ? [14:03:23] <_joe_> so I have to discover how to handle the bios :P [14:03:49] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:04:09] 10Operations, 10Cloud-VPS, 10monitoring, 10User-fgiunchedi: Diamond collectors collects NFS statistics on Cloud-VPS - https://phabricator.wikimedia.org/T171583#3470079 (10zhuyifei1999) This is probably caused by [[https://github.com/python-diamond/Diamond/commit/1d85b93defecc338cbf0e9a7dee371204c8311dd|1d8... [14:04:54] !log restarting elastic on relforge100x servers to test new config [14:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:35] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:07:26] 10Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for fajr18 - https://phabricator.wikimedia.org/T171591#3470096 (10Fajr18) [14:08:53] ##1Cerveza [14:09:00] uhoh [14:09:05] cmjohnson1: sounds like a password :) [14:09:05] cmjohnson1: Bit early to drink? ;) [14:10:13] <_joe_> urandom: did you already depool restbase2002? [14:10:48] _joe_: no [14:11:04] have we been doing that? [14:11:11] [14:11:22] i haven't [14:11:43] yep..not a password to anything...other than my macbook...will change that now [14:12:16] <_joe_> urandom: we should, this server is going to be down for some time [14:12:32] <_joe_> I'm handling it [14:12:58] 10Operations, 10Ops-Access-Requests: Requesting access to RESOURCE for fajr18 - https://phabricator.wikimedia.org/T171591#3470149 (10Fajr18) Hi, I am a former user of toolserver. I want to run my bot {it has flag} and I need these resources as an alternative to toolserver. Regards [14:14:34] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2002.codfw.wmnet [14:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:44] <_joe_> urandom: ok to shut the server down? [14:17:40] <_joe_> should we hand-stop cassandra? [14:18:00] _joe_: umm, i can drain (not totally necessary) [14:18:01] 2002? [14:18:13] <_joe_> urandom: yeah, it's ok not to drain it [14:18:20] gimme a sec [14:18:26] <_joe_> I hope to be done soon, but hardware, you never know [14:22:38] <_joe_> urandom: can I reboot the server? [14:22:59] should have kicked this off in parallel; it's almost there [14:23:24] _joe_: {{done}} [14:23:30] you may fire when ready [14:23:31] <_joe_> ok, rebooting [14:23:32] <_joe_> !log shutting down restbase2002, T162735 [14:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:43] T162735: Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735 [14:26:56] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2141401 [14:29:52] <_joe_> !log enabled hyperthreading on restbase2002.codfw.wmnet T162735, rebooting the server [14:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:01] T162735: Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735 [14:34:19] <_joe_> urandom: restbase2002 is back up and with hyperthreading enabled [14:34:29] <_joe_> can you ping me when I can act on restbase1015? [14:34:40] <_joe_> I'm brewing myself a coffee in the meanwhile [14:34:56] _joe_: yup, i'll start the drain [14:35:15] win 61 [14:35:21] <_joe_> un-do the drain on rb2002 too [14:35:42] _joe_: the restart did that [14:35:46] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2002.codfw.wmnet [14:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:58] <_joe_> ok, bbiab [14:36:42] !log draining restbase1015.eqiad.wmnet T162735 [14:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:53] T162735: Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735 [14:38:25] PROBLEM - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused [14:38:47] let me maintenance that [14:41:20] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.138 and port 9042: Connection refused eevans BIOS change (T162735) [14:41:20] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.139:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.139 and port 9042: Connection refused eevans BIOS change (T162735) [14:41:20] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.140:9042 on restbase1015 is CRITICAL: connect to address 10.64.48.140 and port 9042: Connection refused eevans BIOS change (T162735) [14:41:39] _joe_: drained; good to go [14:41:46] today we have peaked to 100K queries per second only on english wikipedia [14:51:01] !log oblivian@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1015.eqiad.wmnet [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:24] 10Operations, 10ops-eqiad, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3470350 (10Cmjohnson) @ema is it okay to take this down..most of the time the server needs a re-install after swapping /dev/sda will this be okay? [14:52:31] <_joe_> !log shutting down restbase1015, T162735 [14:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:42] T162735: Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735 [14:54:28] 10Operations, 10ops-eqiad, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3470368 (10ema) >>! In T171028#3470350, @Cmjohnson wrote: > @ema is it okay to take this down..most of the time the server needs a re-install after swapping /dev/sda will this be okay? @Cmjohnson: yes.... [14:55:24] where's a good place to report Phab spam? https://phabricator.wikimedia.org/T171596 [14:57:28] _joe_: i'm about to (virtually) step into a meeting, but i'll keep an eye here in case of issue [14:57:42] <_joe_> urandom: rebooting now [14:57:49] <_joe_> I'll ping you in case of need [14:57:55] kk [14:58:29] <_joe_> !log enabled hyperthreading on restbase1015.eqiad.wmnet T162735, rebooting the server [14:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:39] T162735: Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735 [15:00:13] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3470414 (10Fjalapeno) @fgiunchedi thanks for the info… I'm worki... [15:03:35] RECOVERY - cassandra-a CQL 10.64.48.138:9042 on restbase1015 is OK: TCP OK - 0.000 second response time on 10.64.48.138 port 9042 [15:03:52] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1015.eqiad.wmnet [15:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:50] _joe_: thank you! [15:06:04] <_joe_> urandom: all ok afaict [15:06:27] 10Operations, 10DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#3470470 (10Joe) [15:06:29] 10Operations, 10Cassandra, 10Services (blocked), 10User-Joe, 10User-fgiunchedi: Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735#3470468 (10Joe) 05Open>03Resolved a:03Joe [15:06:44] _joe_: yeah, looks good! [15:07:36] PROBLEM - Check systemd state on relforge1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:10:44] gehel: ^ fyi [15:11:00] chasemp: thanks! dcausse ^ [15:11:04] dcausse: I'll disable alerting [15:11:15] oops [15:11:25] sorry [15:11:46] dcausse: how long do you need? [15:12:00] several hours, 10 hours should be fine [15:12:25] dcausse: disabled until tomorrow, ping me when done and I'll re-enable [15:12:32] gehel: thanks [15:20:58] (03CR) 10BryanDavis: [C: 031] "Should be scheduled for a SWAT window." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367369 (owner: 10MarcoAurelio) [15:21:35] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2097194 [15:24:40] (03PS1) 10Ottomata: Burrow should monitor eventlogging_consumer_mysql_eventbus_00 [puppet] - 10https://gerrit.wikimedia.org/r/367684 [15:25:12] elukey: also ^ [15:25:15] +1 [15:25:24] (03CR) 10Ottomata: [V: 032 C: 032] Burrow should monitor eventlogging_consumer_mysql_eventbus_00 [puppet] - 10https://gerrit.wikimedia.org/r/367684 (owner: 10Ottomata) [15:27:45] 10Operations, 10LDAP-Access-Requests, 10Wikidata-Sprint: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3435056 (10RStallman-legalteam) Yes, NDA has been fully executed and is on file. Thank you! [15:31:40] !log updating firmware lvs1007 T167299 [15:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:50] T167299: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299 [15:32:19] bblack ^ [15:33:26] cmjohnson1: \o/ [15:44:21] 10Operations, 10ops-codfw: ms-be2024 not powering on - https://phabricator.wikimedia.org/T171275#3470570 (10Papaul) p:05Triage>03Normal [15:46:35] (03PS1) 10Filippo Giunchedi: thumbor: fix connections-per-backend in nginx [puppet] - 10https://gerrit.wikimedia.org/r/367687 (https://phabricator.wikimedia.org/T171468) [15:47:42] godog: hey can i take over ms-be2024? [15:48:15] PROBLEM - nova-compute process on labvirt1011 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [15:49:15] RECOVERY - nova-compute process on labvirt1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [15:49:20] papaul: hey, yes please! I don't know what kind of reanimation it'll need [15:49:26] !log installing imagemagick security updates on trusty hosts (jessie already fixed) [15:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:05] godog: had the same problem with elastic2020 so we replaced the main baord [15:51:12] (03PS3) 10Giuseppe Lavagetto: wmflib: fix all Hiera backends' Rubocop infractions [puppet] - 10https://gerrit.wikimedia.org/r/359447 (owner: 10Faidon Liambotis) [15:55:31] papaul: ugh, ok! let me know how the debugging goes [15:56:10] godog: ok [15:57:31] 10Operations, 10ops-codfw: ms-be2024 not powering on - https://phabricator.wikimedia.org/T171275#3459460 (10RobH) @Papaul: Please drain power via power cable removal. Once that is done, please plug in the server, ensure the power is off, and then attempt to power it on via ilom commands in SSH (not via the p... [16:00:05] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T1600). [16:00:05] Smalyshev and Amir1: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:12] around [16:00:32] Amir1: taking a look [16:01:22] Thanks! [16:02:23] (03PS2) 10Filippo Giunchedi: mediawiki: increase the maximum time of dispatchChanges cronjob [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [16:02:57] S.Malyshev patch is already merged so noop [16:03:06] <_joe_> godog: uhm hold on a sec [16:03:28] <_joe_> -1 from me on that change Amir1 [16:03:46] <_joe_> we did reduce that value in the past as those jobs were piling up one over each other [16:04:04] <_joe_> either we reduce frequency, or I'm against changing that timeout [16:04:58] <_joe_> the comment about '4 concurrent instances' is somewhat inaccurate [16:05:22] It's a little hard to understand [16:05:36] <_joe_> what is hard to understand? [16:05:41] <_joe_> I can try to explain [16:05:47] <_joe_> sorry, brb [16:05:49] your comments [16:05:51] kk [16:06:05] 10Operations, 10ops-codfw: ms-be2024 not powering on - https://phabricator.wikimedia.org/T171275#3470658 (10Papaul) @RobH we did what you mentioned on T149006 the server came up and after a couple days we went back to the same problem [16:06:37] papaul: oh, i wasnt aware that mx-be2024 had this issue already [16:06:49] have we tried updating the firmware to the latest version before calling for support? [16:07:08] otherwise we'll have to look at either replacing the mainboard or the ilom card (or both) [16:07:37] oh, thats a different server [16:07:46] <_joe_> Amir1: so first thing is - I'd like an opinion from our DBAs [16:08:12] <_joe_> marostegui, is it ok for you to raise the load of syncing wikidata by 25% [16:08:15] <_joe_> ? [16:08:21] <_joe_> on average, I mean [16:08:40] <_joe_> Amir1: as per my comments - timeouts on those scripts are unreliable [16:08:47] <_joe_> they do pile up from time to time [16:09:16] <_joe_> I'll add - if this is the only sync mecahnism that doesn't rely on the jobqueue and is also the one lagging so much - the two things might be related? [16:09:28] robh: yes all the steps that you mentioned we did that with the elastic2020 just wanted to mentioned that but i am working on doing the same steps again since it is the same error we got with elastic202 and we spend 1 month fixing it [16:09:43] papaul: yes but ive had that error on dozens of systems [16:09:53] and elastic2020 had multiple failed parts =] [16:09:59] <_joe_> it would be much much easier for us to devolve resources to syncing wikidata via the traditional method of doing async changes we currently have - unless that was basically done and now we use that script to enqueue jobs [16:10:06] <_joe_> I don't remember if that happened [16:10:13] robh: I agree just wanted to have a reference [16:10:23] cool, good to know about [16:10:38] <_joe_> Amir1: at the very least, we need a sign-off from the dbas [16:10:56] 10Operations, 10ops-codfw: ms-be2024 not powering on - https://phabricator.wikimedia.org/T171275#3470665 (10RobH) That is a different server (elastic2020) than this one though That system had a number of failed parts. That failure can be caused by bad firmware state, simple power reset required, or failed ha... [16:10:59] <_joe_> let's talk with them tomorrow morning and I'll merge the change if they're ok with a 25% increase in that load [16:11:59] _joe_: what about https://gerrit.wikimedia.org/r/#/c/364148/ today? :) :) :) [16:12:04] My laptop just died. I'll be back in a sec [16:12:16] 10Operations, 10LDAP-Access-Requests, 10Wikidata-Sprint: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3470667 (10MoritzMuehlenhoff) @christophneuroth : Ready to add you then, can you please send me your @wikimedia.de email address? [16:12:25] <_joe_> greg-g: my "today" is almost over, maybe tomorrow :P [16:12:33] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/367658 (owner: 10Giuseppe Lavagetto) [16:12:43] _joe_: fiiiiine :) [16:13:00] * greg-g meant during puppetswat, just to be clear [16:13:07] <_joe_> greg-g: ot [16:13:12] <_joe_> it's my change :P [16:13:34] <_joe_> I can merge it whenever, puppetswat is meant to give people without +2 a moment to get ops attention for simple changes [16:14:19] yeah, was just a friendly way to poke you :) [16:15:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "this value was lowered in the past as higher values would create operational issues on terbium. Also, we would be raising the dispatch rat" [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [16:16:39] !log about to delete orfphan files on einstenium T149557 [16:16:49] (03PS2) 10Giuseppe Lavagetto: apt::repository: fix for future parser [puppet] - 10https://gerrit.wikimedia.org/r/367658 [16:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:50] T149557: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557 [16:17:18] <_joe_> greg-g: yeah got it :) Issue is, such a change requires me to watch over hhvm in beta for some time, something I won't be doing at this time of the day, given I have a meeting in ~ 40 [16:18:20] true, I forgot how late puppetswat is in your day [16:19:11] (03CR) 10Giuseppe Lavagetto: [C: 032] apt::repository: fix for future parser [puppet] - 10https://gerrit.wikimedia.org/r/367658 (owner: 10Giuseppe Lavagetto) [16:19:24] _joe_: It's not putting too much pressure on database as it relies on redis now for lock managing [16:20:16] <_joe_> Amir1: I'd still want an all clear from dbas for a change that will increase the update volume by 25% [16:20:35] regarding the timeout being unreliable, well that's a bug that needs to be fixed, we refactored the code in the past couple of months but that definitely needs to be looked at [16:20:41] _joe_: fine for me :) [16:21:12] <_joe_> Amir1: I'd like you people to spend time in making this go through the jobqueue [16:21:26] <_joe_> but we can talk about this in another moment, I'm rather busy, sorry [16:22:04] 10Operations, 10ops-codfw: failing RAID disk on frdb2001 - https://phabricator.wikimedia.org/T171584#3469848 (10RobH) Unfortunately, the warranty for this particular machine just expired on July 10th. So we'll have to replace the disk with on-site spares from our own pool, not in warranty replacement. I've a... [16:23:02] (03PS1) 10Andrew Bogott: nova-compute: try to fix an incorrect check_procs alert [puppet] - 10https://gerrit.wikimedia.org/r/367692 (https://phabricator.wikimedia.org/T171606) [16:24:39] (03PS1) 10Giuseppe Lavagetto: prometheus::node_exporter: fix compatibility with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/367694 [16:25:56] (03CR) 10Andrew Bogott: [C: 032] nova-compute: try to fix an incorrect check_procs alert [puppet] - 10https://gerrit.wikimedia.org/r/367692 (https://phabricator.wikimedia.org/T171606) (owner: 10Andrew Bogott) [16:26:58] I'm pretty sure it sends them to jobqueue [16:27:25] RECOVERY - Host ms-be2024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.56 ms [16:28:16] (03PS2) 10Giuseppe Lavagetto: prometheus::node_exporter: fix compatibility with the future parser [puppet] - 10https://gerrit.wikimedia.org/r/367694 [16:29:16] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler02/7160/conf1001.eqiad.wmnet/ NOOP with the current parser, works equally with the future one." [puppet] - 10https://gerrit.wikimedia.org/r/367694 (owner: 10Giuseppe Lavagetto) [16:31:32] (03PS3) 10Madhuvishy: Remove specific version annotation for nginx [puppet] - 10https://gerrit.wikimedia.org/r/365650 (owner: 10Muehlenhoff) [16:32:04] (03CR) 10Madhuvishy: "+1, can be merged after the notebook1001 nginx upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/365650 (owner: 10Muehlenhoff) [16:33:55] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2029376 [16:34:14] (03CR) 10Muehlenhoff: [C: 032] Remove specific version annotation for nginx [puppet] - 10https://gerrit.wikimedia.org/r/365650 (owner: 10Muehlenhoff) [16:34:50] 10Operations, 10LDAP-Access-Requests, 10Wikidata-Sprint: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3470755 (10christophneuroth) @MoritzMuehlenhoff sure, it's christoph.neuroth_ext@wikimedia.de [16:35:55] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-common] [16:36:12] ^ on it [16:37:57] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:39:09] 10Operations: Upgrade nginx on notebook* servers - https://phabricator.wikimedia.org/T156495#3470763 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff The notebook* servers are now running 1.11.10-1+wmf3 [16:41:06] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3470770 (10jcrespo) 05Open>03Resolved [16:42:26] 10Operations, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3470774 (10jcrespo) [16:42:29] 10Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3187493 (10jcrespo) [16:44:25] PROBLEM - Host ms-be2024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:45:05] 10Operations, 10Availability (Multiple-active-datacenters), 10DC-Switchover-Prep-Q3-2016-17, 10Epic: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3470778 (10jcrespo) 05Open>03Resolved a:03Joe I would close this as resolved, the only unchecked par... [16:49:35] RECOVERY - Host ms-be2024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.58 ms [16:50:06] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:51:40] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3470841 (10Fjalapeno) [16:55:06] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [16:56:12] 10Operations, 10ops-codfw: failing RAID disk on frdb2001 - https://phabricator.wikimedia.org/T171584#3470879 (10RobH) Ok, worked with Jeff, I had some mistakes in my config and fqdn. physicaldrive 1I:1:1 (port 1I:box 1:bay 1, 600 GB): OK physicaldrive 1I:1:2 (port 1I:box 1:bay 2, 600 GB): OK physical... [16:58:11] Is it still puppet swat? :P [16:58:16] <_joe_> nope [16:58:21] <_joe_> I'm in a meeting in 2 minutes [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T1700). [17:00:15] ORES is up for a deploy [17:00:18] no parsoid deploy today [17:00:29] I'll get started on it. [17:00:56] (03PS2) 10Jcrespo: mariadb: Add grants for rddmark to m1 [puppet] - 10https://gerrit.wikimedia.org/r/365035 (https://phabricator.wikimedia.org/T170158) [17:02:11] (03PS1) 10Muehlenhoff: Add chrisneuroth to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/367699 [17:03:07] (03PS3) 10Jcrespo: mariadb: Add grants for rddmarc to m1 [puppet] - 10https://gerrit.wikimedia.org/r/365035 (https://phabricator.wikimedia.org/T170158) [17:05:55] !log halfak@tin Started deploy [ores/deploy@835d848]: T171505 [17:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:04] Here we go [17:06:04] T171505: Late-July 2017 ORES deploy - https://phabricator.wikimedia.org/T171505 [17:06:22] awight, Amir1: ^ [17:08:04] (03CR) 10Muehlenhoff: [C: 032] Add chrisneuroth to LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/367699 (owner: 10Muehlenhoff) [17:09:13] woot! [17:10:07] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.9 [keeping static files] (duration: 01m 39s) [17:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:28] 10Operations, 10LDAP-Access-Requests, 10Wikidata-Sprint: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3471028 (10MoritzMuehlenhoff) 05Open>03Resolved @christophneuroth : I've added you to the "nda" group now. You should now be able to log into the services listed here... [17:10:32] canary check time [17:10:34] awight, ^ [17:10:37] want to do that [17:10:44] Specifically wikidatawiki and fawiki [17:12:38] OK. Looks like I'm doing it :) [17:12:53] 10Operations, 10LDAP-Access-Requests, 10Wikidata-Sprint: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3471038 (10MoritzMuehlenhoff) Sorry, I meant "wmde" group, not "nda". [17:13:42] harr [17:13:45] Canary looks good. [17:13:46] Moving on [17:13:49] trigger-happy [17:13:55] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 28963 [17:14:22] Oh wait. I think we have an increase in timeout errors in eqiad. [17:14:27] https://grafana.wikimedia.org/dashboard/db/ores?panelId=11&fullscreen&orgId=1 [17:15:05] halfak: Is that really 1.25 per minute? [17:15:28] Anyway, it started 1.5h ago so it’s not the deployment eh? [17:15:38] right [17:16:49] (03CR) 10Bearloga: Move R-related code from shiny_server to separate module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/366170 (https://phabricator.wikimedia.org/T153856) (owner: 10Bearloga) [17:17:04] OK moving on! [17:17:28] (03PS2) 10Bearloga: Move R-related code from shiny_server to separate module [puppet] - 10https://gerrit.wikimedia.org/r/366170 (https://phabricator.wikimedia.org/T153856) [17:18:59] Fetch stage 11% [17:19:31] That timeout graph is strange. I can’t imagine why we would have had such a steady rate in the first place. Maybe I don’t understand the cause of timeouts. [17:19:45] halfak: Speaking of fetching speeds, I filed T171619 for ORES a bit ago :) [17:19:45] T171619: ORES should use git-fat for wheel deployments - https://phabricator.wikimedia.org/T171619 [17:20:21] RainbowSprinkles: hey, awesome. We’ve been discussing git-lfs for the 10MB binary models, too. [17:20:33] Is git-fat ready for production use? [17:20:37] We don't support git-lfs in scap (yet? we could possibly) [17:20:45] And yes, git-fat is production-ready [17:20:50] We use it for a number of projects [17:20:55] (mostly jar files, right now) [17:21:05] halfak: Amir1: ^ holler [17:21:10] RainbowSprinkles, it needs to be available publicly for our labs deployes. [17:21:22] Also, git-lfs is just so much better :) [17:21:29] lol holy war [17:21:32] right [17:21:37] but public is the blocker [17:21:41] (03PS1) 10Jcrespo: rddmarc: Add fake dbpassword for the database application [labs/private] - 10https://gerrit.wikimedia.org/r/367702 (https://phabricator.wikimedia.org/T170158) [17:21:45] lfs is just, like, my opinion, man [17:21:52] Gerrit doesn't have git-lfs yet -- we need to upgrade + add a plugin [17:21:57] :) [17:21:59] Sooooo, tbd [17:22:08] RainbowSprinkles, should I file a task for that? [17:22:15] maybe we should see which of the two has better vim and/or emacs plugins and then bring in the meta-holy-war [17:22:27] halfak: Nope, it's already in my mental register of 190219212891289 tasks I need to deal with :p [17:22:28] bblack, lool [17:22:32] (03CR) 10Jcrespo: [V: 032 C: 032] rddmarc: Add fake dbpassword for the database application [labs/private] - 10https://gerrit.wikimedia.org/r/367702 (https://phabricator.wikimedia.org/T170158) (owner: 10Jcrespo) [17:22:38] (how I haven't hit a buffer overflow yet, idk) [17:22:53] :P But how will I track progress in your brain register ;) [17:23:22] Hmm, I think there's a MW extension for this. [17:23:23] I don’t see anything inherently private about git-fat, fwiw. halfak: I guess the blocker you’re referring to is due to WMF configuration? [17:23:26] Probably is [17:23:31] :) [17:23:43] awight: git-fat uses rsync, so needs a place to hit that [17:23:50] But right, not *inherently* private [17:24:30] (03CR) 10Jcrespo: "Keith- just waiting for your ok (+1) to create the db, the user and the backups." [puppet] - 10https://gerrit.wikimedia.org/r/365035 (https://phabricator.wikimedia.org/T170158) (owner: 10Jcrespo) [17:24:35] let's not block migrating away from binaries in git on git-lfs, git-fat is the standard in wmf production [17:25:24] Oh yeah, same page there :) [17:25:27] greg-g: o/! Seems like we’ll block either way, though? Maybe we should write the task like “need either one” [17:25:38] The performance problem with the ORES wheels will be a problem before we get lfs support :) [17:25:55] I'd rather not let the perfect be the enemy of the good here [17:25:57] (03CR) 10Bearloga: "> That being said, I'm of 2 minds about this. On one hand, the refactoring done here makes a lot of sense and providing a more generic "r"" [puppet] - 10https://gerrit.wikimedia.org/r/366170 (https://phabricator.wikimedia.org/T153856) (owner: 10Bearloga) [17:26:09] greg-g, we can't rely on git-fat for our labs deploys -- last I heard. [17:26:14] Wheels aren't a problem [17:26:16] yeah the models problem is even worse: our editquality repo is creeping on up to 2GB [17:26:17] models are a problem [17:26:38] I can't remember who I last talked to about git-fat [17:26:42] Ah, I was pointed at the wheels [17:26:47] But I was told that it wouldn't work for us. [17:26:48] But yeah, models will be an issue too I guess [17:26:52] wheels are tiny ^_^ [17:26:57] models are big [17:27:05] And models change all the time [17:27:09] wheels are mostly stable [17:27:10] https://github.com/wiki-ai/editquality/tree/master/models [17:27:38] (03CR) 10Gehel: "If we already have a real use case for this refactoring and we add a disclaimer, then yes, I think it is reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/366170 (https://phabricator.wikimedia.org/T153856) (owner: 10Bearloga) [17:28:44] (03CR) 10Herron: [C: 031] "LGTM, thanks Jaime!" [puppet] - 10https://gerrit.wikimedia.org/r/365035 (https://phabricator.wikimedia.org/T170158) (owner: 10Jcrespo) [17:29:00] (03PS4) 10Jcrespo: mariadb: Add grants for rddmarc to m1 [puppet] - 10https://gerrit.wikimedia.org/r/365035 (https://phabricator.wikimedia.org/T170158) [17:29:03] We’re working on this issue under T170967 — I just CC’d releng [17:29:03] T170967: Split editquality repo to two repos, one with full history, one shallow - https://phabricator.wikimedia.org/T170967 [17:29:51] 10Operations, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3471092 (10madhuvishy) [17:30:47] 88% fetch [17:30:59] (03CR) 10Jcrespo: [C: 032] mariadb: Add grants for rddmarc to m1 [puppet] - 10https://gerrit.wikimedia.org/r/365035 (https://phabricator.wikimedia.org/T170158) (owner: 10Jcrespo) [17:33:36] promoting and restarting [17:33:41] !log creating new database on m1 (rddmarc) T170158 [17:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:51] T170158: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158 [17:34:23] (03PS1) 10Chad: group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367704 [17:34:31] 10Operations, 10DBA, 10Mail, 10Patch-For-Review: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3471138 (10jcrespo) [17:35:04] 10Operations, 10DBA, 10Mail, 10Patch-For-Review: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3420991 (10jcrespo) You showed me a link for the database charset recommend, which one was it @herron ? [17:35:10] herron: ^ [17:35:11] (03CR) 10Chad: [C: 04-2] "Later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367704 (owner: 10Chad) [17:36:14] jynus utf8 please :) [17:36:25] ok, here is the problem [17:36:32] utf8 is 3 bytes [17:36:36] on mysql [17:36:42] utf8mb4 is utf8 [17:36:44] which is confusing [17:36:54] I will create it as utf8mb4 [17:37:01] but tell me if the installation fails [17:37:07] ok sounds good, thanks! [17:37:21] !log demon@tin Started scap: bootstrap wmf.11 [17:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:34] 10Operations, 10DBA, 10Mail, 10Patch-For-Review: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3471178 (10jcrespo) ``` $ mysql -h m1-master.eqiad.wmnet --skip-ssl -e "SHOW CREATE DATABASE rddmarc" +----------+-------------------------------------------------------------------... [17:38:42] 10Operations, 10DBA, 10Mail, 10Patch-For-Review: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3471181 (10jcrespo) [17:40:04] gwicke: was it you who had an "export etherpad to mediawiki syntax" tool? [17:40:52] !log halfak@tin Finished deploy [ores/deploy@835d848]: T171505 (duration: 34m 56s) [17:40:58] Success! [17:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:02] Thanks all [17:41:02] T171505: Late-July 2017 ORES deploy - https://phabricator.wikimedia.org/T171505 [17:41:07] Amir1, awight ^ FYI [17:41:21] o/5 [17:41:42] greg-g: AFAIK it’s to export as HTML and paste into VE [17:42:01] awight: close enough.... [17:42:18] If I know about it, this is almost certainly not the expert way though :-) [17:42:32] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3471226 (10Cmjohnson) The bios update that I have has failed to install....looking at another solution. [17:42:34] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Mobile, 10Reading-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3471227 (10Jdlrobson) [17:46:32] 10Operations, 10Puppet, 10Wikimedia-Apache-configuration, 10Mobile, 10Reading-Web-Backlog (Tracking): On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist - https://phabricator.wikimedia.org/T154026#3471281 (10Jdlrobson) [17:47:44] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3471295 (10RobH) [17:47:56] 10Operations, 10Wikimedia-General-or-Unknown, 10Wikimedia-Site-requests, 10Mobile: Wikimania 2017 site does not automatically redirect to mobile site, when opening from a mobile device - https://phabricator.wikimedia.org/T120943#1865328 (10Jdlrobson) [17:48:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3471329 (10RobH) [17:49:02] (03PS1) 10Jcrespo: rddmarc: enable database connection only from the m1 dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/367708 (https://phabricator.wikimedia.org/T170158) [17:49:14] !log Rolling restart of codfw Cassandra instances (applying OpenJDK update) [17:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:36] 10Operations, 10Continuous-Integration-Infrastructure, 10Discovery, 10Discovery-Analysis, 10Release-Engineering-Team (Watching / External): Setup a mirror for R language dependencies (CRAN) - https://phabricator.wikimedia.org/T170995#3471367 (10mpopov) From @Ottomata at https://gerrit.wikimedia.org/r/#/c... [17:51:43] (03PS2) 10Jcrespo: rddmarc: enable database connection only from the m1 dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/367708 (https://phabricator.wikimedia.org/T170158) [17:52:17] (03CR) 10Jcrespo: [V: 032 C: 032] rddmarc: enable database connection only from the m1 dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/367708 (https://phabricator.wikimedia.org/T170158) (owner: 10Jcrespo) [17:53:53] !log demon@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) [17:53:53] !log demon@tin scap failed: RuntimeError scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/2cc7028226a539553178454fc2f14459 for details) (duration: 16m 32s) [17:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:32] !log demon@tin Started scap: bootstrap wmf.11 (x2) [17:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:25] PROBLEM - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.163 and port 9042: Connection refused [17:56:43] 10Operations, 10DBA, 10Mail, 10Patch-For-Review: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3471400 (10jcrespo) [17:57:25] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.163 port 9042 [17:57:37] ^^ that's part of the restart [17:58:08] i think a few will slip through, the restart times are really close to what icinga picks up [17:58:11] right on the edge [17:58:31] 10Operations, 10Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#3471404 (10jcrespo) [17:58:34] 10Operations, 10DBA, 10Mail, 10Patch-For-Review: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3420991 (10jcrespo) 05Open>03Resolved ``` root@diadem:~$ mysql -h m1-master.eqiad.wmnet rddmarc -u rddmarc -p Enter password: Welcome to the MySQL monitor. Commands end with ;... [18:01:06] PROBLEM - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.164 and port 9042: Connection refused [18:02:06] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042 [18:02:48] 10Operations, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3471455 (10jcrespo) [18:03:15] RECOVERY - Host ms-be2024 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [18:04:55] PROBLEM - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.176 and port 9042: Connection refused [18:05:35] PROBLEM - SSH on ms-be2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:35] PROBLEM - Disk space on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:35] PROBLEM - dhclient process on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:35] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:35] PROBLEM - DPKG on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:36] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:36] PROBLEM - salt-minion processes on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:37] PROBLEM - configured eth on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:37] PROBLEM - swift-object-auditor on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:38] PROBLEM - Check size of conntrack table on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:38] PROBLEM - swift-object-updater on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:39] PROBLEM - swift-account-auditor on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:39] PROBLEM - swift-account-reaper on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:40] PROBLEM - swift-account-server on ms-be2024 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:05:56] RECOVERY - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.176 port 9042 [18:06:16] (03PS1) 10EBernhardson: Decrease elasticsearch search threadp pool to 32 for cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/367709 (https://phabricator.wikimedia.org/T169498) [18:08:44] (03CR) 10EBernhardson: "puppet compiler output looks sane: http://puppet-compiler.wmflabs.org/7161/elastic1040.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/367709 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [18:09:04] (03PS2) 10EBernhardson: Decrease elasticsearch search thread pool to 32 for cirrus servers [puppet] - 10https://gerrit.wikimedia.org/r/367709 (https://phabricator.wikimedia.org/T169498) [18:09:35] PROBLEM - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.177 and port 9042: Connection refused [18:09:36] "Virtual Serial Port is currently in use by another session." [18:09:42] 10Operations, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3471520 (10Krinkle) [18:09:55] PROBLEM - Host ms-be2024 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:56] anyone else checking ms-be2024 ? [18:10:36] RECOVERY - cassandra-b CQL 10.192.16.177:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.177 port 9042 [18:10:53] oh, this may be known [18:12:20] 10Operations, 10ops-codfw: ms-be2024 not powering on - https://phabricator.wikimedia.org/T171275#3459460 (10jcrespo) This went down right now- I am not going to touch nothing because it may be under maintenance. [18:12:23] 10Operations, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3471601 (10Krinkle) [18:13:46] PROBLEM - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.178 and port 9042: Connection refused [18:13:49] 10Operations, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#2990470 (10Krinkle) [18:14:46] RECOVERY - cassandra-c CQL 10.192.16.178:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.178 port 9042 [18:14:56] !log demon@tin Finished scap: bootstrap wmf.11 (x2) (duration: 19m 23s) [18:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:49] 10Operations, 10ops-codfw: ms-be2024 not powering on - https://phabricator.wikimedia.org/T171275#3471642 (10Papaul) a:05Papaul>03fgiunchedi Step 1 - Drain power Step 2 - Upgrade ILO firmware from 2.4 to 2.54 Test to power /on-off the server, it is working now. @fgiunchedi when the server came up, it was... [18:15:57] 10Operations, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3471645 (10jcrespo) [18:16:44] 10Operations, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#2990470 (10jcrespo) [18:23:36] (03PS1) 10Rush: diamond: set diskusage filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) [18:24:21] (03PS2) 10Rush: diamond: set diskusage filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) [18:26:19] (03CR) 10jerkins-bot: [V: 04-1] diamond: set diskusage filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) (owner: 10Rush) [18:27:39] (03PS3) 10Rush: diamond: set diskspace filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) [18:27:53] (03PS4) 10Rush: diamond: set diskspace filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) [18:28:32] !log restbase upgrading node to v6.11 - T170548 [18:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:44] T170548: nodejs 6.11 - https://phabricator.wikimedia.org/T170548 [18:29:55] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.134 and port 9042: Connection refused [18:30:27] (03CR) 10jerkins-bot: [V: 04-1] diamond: set diskspace filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) (owner: 10Rush) [18:30:55] RECOVERY - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.134 port 9042 [18:33:05] PROBLEM - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is CRITICAL: connect to address 10.192.32.135 and port 9042: Connection refused [18:34:05] RECOVERY - cassandra-b CQL 10.192.32.135:9042 on restbase2003 is OK: TCP OK - 0.036 second response time on 10.192.32.135 port 9042 [18:34:41] (03PS2) 10Krinkle: Enable jQuery 3 on test.wikipedia.org and test2.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366994 (https://phabricator.wikimedia.org/T124742) [18:35:26] (03CR) 10Krinkle: [C: 032] Enable jQuery 3 on test.wikipedia.org and test2.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366994 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [18:35:44] (03PS4) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) [18:36:06] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) (owner: 10Paladox) [18:36:58] (03Merged) 10jenkins-bot: Enable jQuery 3 on test.wikipedia.org and test2.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366994 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [18:37:08] (03CR) 10jenkins-bot: Enable jQuery 3 on test.wikipedia.org and test2.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366994 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [18:37:41] (03PS5) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) [18:38:03] (03CR) 10Paladox: Phabricator: Redirect all http traffic to https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) (owner: 10Paladox) [18:38:34] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) (owner: 10Paladox) [18:39:10] (03PS1) 10Gehel: maps - switch to using the default 3857 projection [puppet] - 10https://gerrit.wikimedia.org/r/367713 (https://phabricator.wikimedia.org/T169011) [18:39:12] (03PS6) 10Paladox: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) [18:41:19] !log mobrovac@tin Started deploy [restbase/deploy@36ca85f]: Switch to Node v6.11 - T170548 [18:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:30] T170548: nodejs 6.11 - https://phabricator.wikimedia.org/T170548 [18:41:56] (03CR) 10Pnorman: [C: 031] maps - switch to using the default 3857 projection [puppet] - 10https://gerrit.wikimedia.org/r/367713 (https://phabricator.wikimedia.org/T169011) (owner: 10Gehel) [18:42:02] (03CR) 10Gehel: [C: 032] maps - switch to using the default 3857 projection [puppet] - 10https://gerrit.wikimedia.org/r/367713 (https://phabricator.wikimedia.org/T169011) (owner: 10Gehel) [18:42:17] !log krinkle@tin Synchronized wmf-config/InitialiseSettings.php: Enable jQuery 3 on testwikis - I37a68472cf (duration: 00m 50s) [18:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:21] 10Operations, 10monitoring: On stretch, python metric collector for disk is on DEBUG logging mode - https://phabricator.wikimedia.org/T171638#3471757 (10jcrespo) [18:45:35] PROBLEM - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.144 and port 9042: Connection refused [18:46:35] RECOVERY - cassandra-b CQL 10.192.32.144:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.144 port 9042 [18:46:36] !log mobrovac@tin Finished deploy [restbase/deploy@36ca85f]: Switch to Node v6.11 - T170548 (duration: 05m 17s) [18:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:45] T170548: nodejs 6.11 - https://phabricator.wikimedia.org/T170548 [18:47:42] !log mobrovac@tin Started deploy [restbase/deploy@36ca85f]: (no justification provided) [18:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:31] !log mobrovac@tin Finished deploy [restbase/deploy@36ca85f]: (no justification provided) (duration: 00m 49s) [18:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:40] !log mobrovac@tin Started deploy [restbase/deploy@36ca85f]: (no justification provided) [18:48:46] PROBLEM - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is CRITICAL: connect to address 10.192.32.145 and port 9042: Connection refused [18:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:46] RECOVERY - cassandra-c CQL 10.192.32.145:9042 on restbase2008 is OK: TCP OK - 0.036 second response time on 10.192.32.145 port 9042 [18:52:29] !log mobrovac@tin Finished deploy [restbase/deploy@36ca85f]: (no justification provided) (duration: 03m 49s) [18:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T1900). [19:00:43] greg-g: You know what I just realized, the new branch deploys Tuesdays at noon SF time. [19:00:57] The same time as the weekly city sirens test. [19:01:01] It's....oddly appropriate [19:01:05] FALLOUT. TAKE COVER [19:03:40] 10Operations, 10ops-eqiad: Degraded RAID on db1001 - https://phabricator.wikimedia.org/T171232#3471829 (10Cmjohnson) Disk replaced and rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firm... [19:05:35] PROBLEM - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.46 and port 9042: Connection refused [19:06:36] RECOVERY - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.46 port 9042 [19:06:41] 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on labsdb1001 - https://phabricator.wikimedia.org/T171538#3471836 (10Cmjohnson) Disk replaced and rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Onlin... [19:06:59] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3398609 (10Cmjohnson) Disk replaced and rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Rebuild Firmware state: Online, Spun Up Firmware state: Online, Spun Up Fi... [19:09:28] (03PS1) 10Andrew Bogott: nova fullstack: Support specifying a labvirt host to test [puppet] - 10https://gerrit.wikimedia.org/r/367718 [19:11:24] 10Operations, 10ops-eqiad, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3471853 (10Cmjohnson) @ema can you verify the host name for me please. cp1008 was decom'd a long time ago. [19:12:28] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3471871 (10Cmjohnson) [19:12:36] 10Operations, 10ops-eqiad, 10Traffic: Degraded RAID on cp1008 - https://phabricator.wikimedia.org/T171028#3451439 (10BBlack) It was decommed a long time ago, and then I revived it as a quasi-production testing machine for "temporary" use for a little while, and probably poorly documented that, and now "tempo... [19:12:45] PROBLEM - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused [19:13:45] RECOVERY - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is OK: TCP OK - 0.036 second response time on 10.192.48.48 port 9042 [19:13:47] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3352141 (10Cmjohnson) @chasemp Which vlan are these going in...I racked in row A and D....i see the instruction say it's public but I see a comment that it's labs-support.... [19:15:21] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install druid100[456].eqiad.wmnet - https://phabricator.wikimedia.org/T171626#3471295 (10Cmjohnson) [19:15:45] PROBLEM - cassandra-a CQL 10.192.48.49:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.49 and port 9042: Connection refused [19:16:39] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3471893 (10Cmjohnson) [19:16:45] RECOVERY - cassandra-a CQL 10.192.48.49:9042 on restbase2006 is OK: TCP OK - 0.036 second response time on 10.192.48.49 port 9042 [19:18:31] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3471907 (10Cmjohnson) [19:18:55] PROBLEM - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.50 and port 9042: Connection refused [19:19:06] (03CR) 10Andrew Bogott: [C: 032] nova fullstack: Support specifying a labvirt host to test [puppet] - 10https://gerrit.wikimedia.org/r/367718 (owner: 10Andrew Bogott) [19:19:52] (03CR) 10Krinkle: [C: 031] diamond: set diskspace filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) (owner: 10Rush) [19:19:55] RECOVERY - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is OK: TCP OK - 0.036 second response time on 10.192.48.50 port 9042 [19:19:56] 10Operations, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on labsdb1001 - https://phabricator.wikimedia.org/T171538#3471914 (10chasemp) thanks you @Cmjohnson [19:20:50] (03PS6) 10Rush: diamond: set diskspace filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) [19:21:10] (03CR) 10Chad: [C: 032] group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367704 (owner: 10Chad) [19:22:14] (03PS7) 10Rush: diamond: set diskspace filesystems explicitly [puppet] - 10https://gerrit.wikimedia.org/r/367710 (https://phabricator.wikimedia.org/T171583) [19:22:35] (03Merged) 10jenkins-bot: group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367704 (owner: 10Chad) [19:22:45] PROBLEM - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.51 and port 9042: Connection refused [19:22:46] (03CR) 10jenkins-bot: group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367704 (owner: 10Chad) [19:24:45] RECOVERY - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is OK: TCP OK - 0.036 second response time on 10.192.48.51 port 9042 [19:28:44] (03PS1) 10Andrew Bogott: nova fullstack: Run on labvirt1016 for a while [puppet] - 10https://gerrit.wikimedia.org/r/367726 (https://phabricator.wikimedia.org/T171641) [19:29:37] (03PS3) 10Bearloga: Move R-related code from shiny_server to separate module [puppet] - 10https://gerrit.wikimedia.org/r/366170 (https://phabricator.wikimedia.org/T153856) [19:29:46] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:29:53] (03CR) 10Andrew Bogott: [C: 032] nova fullstack: Run on labvirt1016 for a while [puppet] - 10https://gerrit.wikimedia.org/r/367726 (https://phabricator.wikimedia.org/T171641) (owner: 10Andrew Bogott) [19:31:55] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.11 [19:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:45] RECOVERY - MegaRAID on db1001 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [19:37:50] 10Operations, 10ops-eqiad: Degraded RAID on db1001 - https://phabricator.wikimedia.org/T171232#3472024 (10Marostegui) 05Open>03Resolved a:03Cmjohnson RAID back to Optimal Thanks Chris!! ``` root@db1001:~# megacli -pdrbld -showprog -physdrv\[32:6\] -aALL Device(Encl-32 Slot-6) is not in rebuild process... [19:40:55] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:44:06] (03CR) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [19:45:49] (03PS23) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [19:48:14] (03CR) 10Krinkle: [C: 031] "Cherry-picked to deployment-prep. Manually verified at https://en.wikipedia.beta.wmflabs.org/--errorpage-noise. The latest change by @Ema " [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [19:51:05] PROBLEM - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.139 and port 9042: Connection refused [19:52:05] RECOVERY - cassandra-c CQL 10.192.32.139:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.139 port 9042 [19:52:50] (03PS2) 10Andrew Bogott: puppetmaster frontend profile: Allow hiera to configure the hostname [puppet] - 10https://gerrit.wikimedia.org/r/367621 [19:54:09] (03CR) 10Andrew Bogott: [C: 032] puppetmaster frontend profile: Allow hiera to configure the hostname [puppet] - 10https://gerrit.wikimedia.org/r/367621 (owner: 10Andrew Bogott) [19:58:25] RECOVERY - MegaRAID on labsdb1001 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [20:07:37] (03PS1) 10Madhuvishy: labstore block_sync: Use the logging library instead of print [puppet] - 10https://gerrit.wikimedia.org/r/367742 [20:18:14] (03PS1) 10Thcipriani: Scap: bump version to 3.6.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/367749 (https://phabricator.wikimedia.org/T127762) [20:32:22] (03PS1) 10Andrew Bogott: added some cnames for new labs puppetmaster names [dns] - 10https://gerrit.wikimedia.org/r/367776 [20:33:34] (03PS2) 10Andrew Bogott: added some cnames for new labs puppetmaster names [dns] - 10https://gerrit.wikimedia.org/r/367776 [20:34:51] (03PS3) 10Andrew Bogott: added some cnames for new labs puppetmaster names [dns] - 10https://gerrit.wikimedia.org/r/367776 [20:35:07] Reedy: hi, is it possible to see logstash for https://phabricator.wikimedia.org/T171612 ? [20:36:16] TabbyCat: I'm almost certain that's a dupe [20:36:38] https://phabricator.wikimedia.org/T171523 [20:36:41] Reedy: no idea, just saw it [20:36:47] and tagged it with "logspam" [20:37:21] (03CR) 10Andrew Bogott: [C: 032] added some cnames for new labs puppetmaster names [dns] - 10https://gerrit.wikimedia.org/r/367776 (owner: 10Andrew Bogott) [20:37:25] Considering both are checkuser [20:37:28] I'd be surprised if it wasn't [20:40:03] TabbyCat: Yeah, it is [20:40:04] Thanks [20:40:22] Reedy: thanks to MatmaRex I've found T171523 [20:40:22] T171523: Looking at deleted versions gives an internal error as checkuser on nlwikipedia - https://phabricator.wikimedia.org/T171523 [20:40:31] tagged UBN omg [20:40:42] I linked that above :P [20:40:53] lol I didn't read you [20:40:57] sorry [20:45:52] 10Operations, 10Patch-For-Review, 10Services (doing), 10User-mobrovac: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3472387 (10ksmith) @Gehel (and @debt): I see a couple maps items at the top of this list. Is this something you are or should be aware of? [20:48:08] 10Operations, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3472417 (10Ladsgroup) [20:49:47] Reedy: can you review the patch on that? [20:59:50] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3472470 (10RobH) [21:11:12] 10Operations, 10Continuous-Integration-Infrastructure, 10Discovery, 10Discovery-Analysis, 10Release-Engineering-Team (Watching / External): Setup a mirror for R language dependencies (CRAN) - https://phabricator.wikimedia.org/T170995#3472542 (10hashar) @mpopov using twitter to get the size was a smart mo... [21:12:35] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - /a is not accessible: Input/output error [21:13:36] !log Rolling restart of eqiad Cassandra instances (applying OpenJDK update) [21:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:59] 10Operations, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3472561 (10Ladsgroup) [21:16:06] PROBLEM - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is CRITICAL: connect to address 10.64.0.230 and port 9042: Connection refused [21:17:06] RECOVERY - cassandra-a CQL 10.64.0.230:9042 on restbase1007 is OK: TCP OK - 0.001 second response time on 10.64.0.230 port 9042 [21:24:55] PROBLEM - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.114 and port 9042: Connection refused [21:25:55] RECOVERY - cassandra-a CQL 10.64.0.114:9042 on restbase1010 is OK: TCP OK - 0.002 second response time on 10.64.0.114 port 9042 [21:31:35] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 22300 [21:34:16] PROBLEM - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.117 and port 9042: Connection refused [21:35:16] RECOVERY - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.117 port 9042 [21:37:25] PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.118 and port 9042: Connection refused [21:38:26] RECOVERY - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is OK: TCP OK - 0.022 second response time on 10.64.0.118 port 9042 [21:40:45] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [21:41:39] (03PS1) 10Ayounsi: Add pfw3-codfw loopback and uplinks IPs to DNS [dns] - 10https://gerrit.wikimedia.org/r/367809 (https://phabricator.wikimedia.org/T169643) [21:41:45] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042 [21:41:45] RECOVERY - Disk space on stat1002 is OK: DISK OK [21:43:48] jouncebot: next [21:43:48] In 1 hour(s) and 16 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T2300) [21:53:46] Reedy: Just deploy it? [21:53:55] I was vaguely planning on [21:54:02] Just waiting for jerkins to merge the master patch first [21:54:28] Tell thcipriani or whomsoever's got the conch this week first, maybe. ;-) [21:54:43] It's merged in master. [21:55:31] Topic in #wikimedia-releng says I'm a team member ;P [21:56:04] Ha. OK, OK. [21:56:04] (03CR) 10Hashar: [C: 031] Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) (owner: 1020after4) [22:00:17] * thcipriani blinks [22:00:51] Reedy: you're welcome :) [22:01:02] heh [22:03:35] PROBLEM - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is CRITICAL: connect to address 10.64.32.203 and port 9042: Connection refused [22:04:35] RECOVERY - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is OK: TCP OK - 0.000 second response time on 10.64.32.203 port 9042 [22:04:37] 10Operations, 10Deployment-Systems, 10MediaWiki-JobRunner, 10Release-Engineering-Team (Next), 10Scap (Scap3-Adoption-Phase1): Figure out how to disable starting of jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3472824 (10Krinkle) [22:06:23] (03PS1) 10Thcipriani: Allow mwdeploy user to restart jobchron [puppet] - 10https://gerrit.wikimedia.org/r/367815 (https://phabricator.wikimedia.org/T129148) [22:10:06] PROBLEM - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.205 and port 9042: Connection refused [22:10:07] (03CR) 10Krinkle: Allow mwdeploy user to restart jobchron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/367815 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [22:11:06] RECOVERY - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.205 port 9042 [22:12:54] !log reedy@tin Synchronized php-1.30.0-wmf.10/includes/specials/SpecialUndelete.php: T171523 (duration: 00m 47s) [22:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:06] T171523: Looking at deleted versions gives an internal error as checkuser - https://phabricator.wikimedia.org/T171523 [22:13:14] (03PS2) 10Thcipriani: Allow mwdeploy user to restart jobchron [puppet] - 10https://gerrit.wikimedia.org/r/367815 (https://phabricator.wikimedia.org/T129148) [22:13:15] PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.206 and port 9042: Connection refused [22:14:15] RECOVERY - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.206 port 9042 [22:14:20] !log reedy@tin Synchronized php-1.30.0-wmf.11/includes/specials/SpecialUndelete.php: T171523 (duration: 00m 46s) [22:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:55] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:37:46] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [22:40:15] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [22:41:05] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [22:44:15] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [22:44:16] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:44:35] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:44:35] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:44:35] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:44:35] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:44:35] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:44:35] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:44:35] more phab spam: https://phabricator.wikimedia.org/p/Wikishopia/ [22:44:36] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [22:44:45] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:45:19] :( [22:45:25] I can't delete them [22:46:25] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:46:35] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [22:46:55] Reedy: need help? [22:47:12] volans: There's a few rars uploaded by the user dbrant linked above [22:47:25] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:47:36] There's some by https://phabricator.wikimedia.org/p/Marama1/ from earlier [22:47:42] * volans looking [22:48:15] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [22:48:15] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [22:48:49] Reedy: also from the user "mohamednalsa" [22:48:59] Destroying [22:49:00] I'll clean all of them (18 files in total) [22:49:09] tldr: sudo bin/remove destroy F123 [22:49:22] RainbowSprinkles: are you saying you're already doing it? [22:49:30] or should I continue :D [22:49:30] I don't have access to the right machine... do I? [22:49:36] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:49:36] I got the first 3 [22:49:48] Reedy: iridium w/ sudo [22:50:01] I usually get the PHID from the DB and then run them all at once [22:50:06] I guess I don't [22:50:35] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [22:51:09] volans: You can do like a batch option and put all them in one `remove` call [22:51:23] Er, there is a force or batch thing that at least avoids the confirmation [22:51:25] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:51:35] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [22:51:35] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:52:00] RainbowSprinkles: T168142#3400290 ;) [22:52:00] T168142: Cleanup phabricator.wikimedia.org uploaded files, WP zero abuse - https://phabricator.wikimedia.org/T168142 [22:52:25] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [22:52:44] Reedy: done [22:53:04] dbrant too ^^^ [22:53:09] taaa [22:53:12] thx!! [22:53:23] i think ema has a patch to disable uploads [22:53:38] he does [22:53:45] ah was merged today [22:53:45] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:54:18] do we have indication that those are Zero too? [22:54:25] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:54:40] Not without finding what IP they signed up on [22:54:42] we know already that they were uploading from non-zero, so it wouldn't surprise me if they are not [22:54:45] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) timed out before a response was received [22:55:45] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:57:25] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [22:57:26] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [22:57:26] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [22:57:35] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [22:57:36] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [22:57:36] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [22:57:36] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [22:57:36] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [22:57:36] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:57:45] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [22:57:45] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170725T2300). Please do the needful. [23:00:04] TabbyCat: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] o/ [23:00:46] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:01:25] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:25] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:26] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:35] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:35] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:36] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:45] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:47] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:47] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:47] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:47] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:47] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:55] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:01:55] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:55] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:55] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:55] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:01:55] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:05] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:05] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:05] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:05] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:05] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:13] (03PS4) 10Reedy: Allow contentadmin/sysop to configure blocking AbuseFilters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367369 (owner: 10MarcoAurelio) [23:02:13] moritzm: ^ [23:02:15] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:15] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:15] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:02:16] (03CR) 10Reedy: [C: 032] Allow contentadmin/sysop to configure blocking AbuseFilters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367369 (owner: 10MarcoAurelio) [23:03:12] TabbyCat: I hope he's alseep [23:03:16] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:03:35] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [23:03:50] Reedy: he's listed in the topic, I don't know further :) [23:03:55] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [23:04:25] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [23:05:05] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:05:18] ^^^ looking at this [23:06:21] (03Merged) 10jenkins-bot: Allow contentadmin/sysop to configure blocking AbuseFilters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367369 (owner: 10MarcoAurelio) [23:06:31] (03CR) 10jenkins-bot: Allow contentadmin/sysop to configure blocking AbuseFilters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367369 (owner: 10MarcoAurelio) [23:06:45] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:07:05] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:07:05] 10Operations, 10ops-codfw, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Rename mw2148 / mw2149 / mw2259 / mw2260 to thumbor200[1234] - https://phabricator.wikimedia.org/T168881#3473033 (10faidon) a:03Papaul @Papaul, this needs to be fixed in the server labels and Racktables. [23:07:35] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:09:45] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [23:09:50] urandom: thoughts on restbase alerts? [23:10:08] greg-g: i think it's a problem with the mobile content service [23:10:35] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [23:10:57] greg-g: still looking [23:12:00] Reedy: can you sync. that change? since wikitech is in silver it cannot be tested on mwdebug [23:12:12] Sorry, got distracted [23:12:49] np [23:12:57] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [23:12:57] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:13:47] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [23:13:56] !log reedy@tin Synchronized wmf-config/abusefilter.php: Allow contentadmin/sysop to configure blocking AbuseFilters (duration: 00m 46s) [23:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:47] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [23:14:48] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [23:14:48] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [23:14:48] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [23:14:48] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [23:14:48] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [23:14:57] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [23:14:57] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [23:14:58] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [23:14:58] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [23:14:58] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [23:15:07] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [23:15:07] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [23:15:08] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [23:15:08] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [23:15:27] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [23:15:27] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [23:15:27] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [23:15:27] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [23:15:28] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [23:15:37] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [23:15:37] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [23:15:37] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [23:15:38] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [23:15:38] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [23:15:47] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [23:16:25] thanks, it works [23:16:28] off to bed [23:22:07] (03PS1) 10Volans: Transports: improve Command class [software/cumin] - 10https://gerrit.wikimedia.org/r/367823 (https://phabricator.wikimedia.org/T171679) [23:22:09] (03PS1) 10Volans: CLI: add an option to ignore exit codes of commands [software/cumin] - 10https://gerrit.wikimedia.org/r/367824 (https://phabricator.wikimedia.org/T171679) [23:22:11] (03PS1) 10Volans: Transports: improve target management [software/cumin] - 10https://gerrit.wikimedia.org/r/367825 (https://phabricator.wikimedia.org/T171684) [23:22:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:22:57] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:23:58] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:27:43] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3473115 (10Dereckson) **Prioritization** >>! In T168765#3466183, @MF-Warburg wrote: > Langcom has reviewed the concerns and found them unfounded. So plea... [23:27:56] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3473117 (10Dereckson) [23:47:20] (03PS1) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [23:55:40] urandom: you're probably offline, but when you're back, curious what you found out :) [23:55:53] urandom: things look healthy now with those recoveries, just curious what happened (if we know) [23:55:57] greg-g: i'm... kind of at a loss [23:56:06] :) [23:56:07] greg-g: i'm going to have to consult those more in the know here [23:56:20] there was an alert for LVS for mobileapps just before that [23:56:24] mcs people? [23:56:27] * greg-g nods [23:56:31] and the error was for that endpoint [23:56:38] the one that proxies requests to mobileapps [23:56:46] so that is what i suspect [23:57:08] but we had just finished some cassandra restarts, and the logs are kind of poisoned with connection failures [23:57:24] which, they would be, at least to a point... [23:57:46] and then it fixed itself [23:57:59] i always have mixed feelings about that :)